Shubhanshu Shubhanshu - 9 months ago 35
Python Question

How to filter a dataframe based on the values present in the list in the rows of a column in Python?

I have a dataframe which looks like:

business_id stars categories
0 abcd 4.0 ['Nightlife']
1 abcd1 3.5 ['Pizza', 'Restaurants']
2 abcd2 4.5 ['Groceries', 'Food']

I want to filter the dataframe based on the values present in the categories column. My dataframe has approximately 400 000 rows and I only want the rows having categories 'Food' or 'Restaurants' in them.

I tried a lot of methods, including:

def foodie(x):
for row in x.itertuples():
if 'Food' in row[3] or 'Restaurant' in row[3]:
return x

df = df.apply(foodie, axis=1)

But this is obviously very very bad method since, I am using itertuples on 400 000 rows and my system goes on processing for infinite amount of time.

I also tried using list comprehension in
. But couldn't, since they all are filtering like
. And even all the
methods I saw, were being implemented for columns having single value in their columns.

So, how can I subset my dataframe using a fairly fast implementation of iterating over my rows and at the same time, select only those rows which have 'Food' or 'Restaurants' in their category?


You can use the apply method on the categories column and check if each element contains the Food or Restaurants based on which create a logic index array for subsetting:

df.loc[df.categories.apply(lambda cat: 'Food' in cat or 'Restaurants' in cat)]

#     business_id             categories      stars
# 1         abcd1   [Pizza, Restaurants]        3.5
# 2         abcd2      [Groceries, Food]        4.5