Shubhanshu Shubhanshu - 1 year ago 103
Python Question

How to filter a dataframe based on the values present in the list in the rows of a column in Python?

I have a dataframe which looks like:

business_id stars categories
0 abcd 4.0 ['Nightlife']
1 abcd1 3.5 ['Pizza', 'Restaurants']
2 abcd2 4.5 ['Groceries', 'Food']

I want to filter the dataframe based on the values present in the categories column. My dataframe has approximately 400 000 rows and I only want the rows having categories 'Food' or 'Restaurants' in them.

I tried a lot of methods, including:

def foodie(x):
for row in x.itertuples():
if 'Food' in row[3] or 'Restaurant' in row[3]:
return x

df = df.apply(foodie, axis=1)

But this is obviously very very bad method since, I am using itertuples on 400 000 rows and my system goes on processing for infinite amount of time.

I also tried using list comprehension in
. But couldn't, since they all are filtering like
. And even all the
methods I saw, were being implemented for columns having single value in their columns.

So, how can I subset my dataframe using a fairly fast implementation of iterating over my rows and at the same time, select only those rows which have 'Food' or 'Restaurants' in their category?

Answer Source

You can use the apply method on the categories column and check if each element contains the Food or Restaurants based on which create a logic index array for subsetting:

df.loc[df.categories.apply(lambda cat: 'Food' in cat or 'Restaurants' in cat)]

#     business_id             categories      stars
# 1         abcd1   [Pizza, Restaurants]        3.5
# 2         abcd2      [Groceries, Food]        4.5
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download