James Eaves James Eaves - 6 months ago 135
Python Question

Convert rows of pandas Dataframe into an iterable list of strings

Imagine I have a dataframe, df, that has 2 columns, a USER_ID and a PRODUCT they bought.

df
USER_ID | PRODUCT
1 a
1 b
1 c
2 d
2 a
2 k


I want to convert this DataFrame into a new one, df2, where each row is one user and the products are aggregated into a list of string.

df2
USER_ID | PRODUCT
1 [a,b,c]
2 [d,a,k]


Finally, I’d like to be able to find the intersection between two user’s PRODUCT lists.

I’m able to create the second dataframe, but the method I’m using results in a list that is somehow not iterable.

Specifically i do: 

df2 = df1.groupby(‘USER_ID)[‘PRODUCT’].agg(lambda x: x.tolist())

which gives me a series that I convert back to a dataframe.

df2 = df2.to_frame()


which gives me the df2 I’m looking for but each list of products has length = 1, thus I can’t compare 1 to another to find the intersection of products. For instance, when I execute:

s1 = df2.PRODUCT[df2.USER_ID == 1]
s2 = df2.PRODUCT[df2.USER_ID == 2]

common_elements = list(set(s1).intersection(set(s2)))
common_elements


The result is an empty list rather than [a]. What am I doing wrong?

Answer

is that what you want?

In [7]: pd.Series(np.intersect1d(df.loc[df.USER_ID == 1, 'PRODUCT'], df.loc[df.USER_ID == 2, 'PRODUCT']))
Out[7]:
0    a
dtype: object

or using index.intersection():

In [18]: (df.set_index('PRODUCT').query('USER_ID == 1').index
   ....:    .intersection(df.set_index('PRODUCT').query('USER_ID == 2').index)
   ....:    .to_series()
   ....: )
Out[18]:
PRODUCT
a    a
Name: PRODUCT, dtype: object

PS i wouldn't convert your df into df2 as you most probably will have a lot of difficulties with this data model (i mean having lists in columns)