lnNoam lnNoam - 28 days ago 6
Python Question

Sum amount associated with item in a column of lists

I have a dataframe as follows:

country letter keywords amount
c y ['fruits', 'apples', "banana"] 700
c y ["music", "dance", "banana"] 150
c y ['loud', "dance", "apples"] 350


I'd like to sum the amount associated with each keyword.
Note:
country
and
letters
are not always the same, as in the contrived data above. Also, the list of
keywords
vary in size.

I've tried several solutions. I've attached my fastest one below. I've also tried solutions with
apply
and
defaultdicts
...

keywords_list = []
for i in zip(*[df[c] for c in df.columns]):
data = list(i[0:2])
for k in i[2]:
row = [k] + data + [i[-1]]
keywords_list(row)

df_expanded = pd.DataFrame(keywords_list)
df_expanded.groupby(list(range(3)))[3].sum().reset_index()





Goal

country letter keywords amount
0 c y apples 1050
1 c y banana 850
2 c y dance 500
3 c y fruits 700
4 c y loud 350
5 c y music 150


Edit: corrected error in example of goal




Data

country = list("ccc")
letters = list("yyy")
keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]]
amount = [700, 150, 350]

df = pd.DataFrame({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount})
df = df[['country', 'letter', 'keywords', 'amount']]

Answer

You can use:

df1 = pd.DataFrame(df.keywords.values.tolist())
       .stack()
       .reset_index(level=1, drop=True)
       .rename('keywords')
print (df1)
0    fruits
0    apples
0    banana
1     music
1     dance
1    banana
2      loud
2     dance
2    apples
Name: keywords, dtype: object

print (df.drop('keywords', axis=1).join(df1).reset_index(drop=True))
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Another solution:

df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index) \
        .stack() \
        .reset_index(name='keywords') \
        .drop('level_3',axis=1)
print (df1)
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Then need groupby with aggregating sum:

print (df.groupby(['country','letter','keywords'], as_index=False)['amount'].sum())
  country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150

Timings:

In [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.Series).stack().reset_index().drop('level_3',1))
1 loop, best of 3: 4.55 s per loop

In [48]: %timeit (jez1(df3))
10 loops, best of 3: 24.8 ms per loop

In [49]: %timeit (jez2(df3))
10 loops, best of 3: 29.7 ms per loop

Code for timings:

df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()                

def jez1(df):
    df1 = pd.DataFrame(df.keywords.values.tolist()).stack().reset_index(level=1, drop=True).rename('keywords')
    return df.drop('keywords', axis=1).join(df1).reset_index(drop=True)

def jez2(df):
    df = df.set_index(['country','letter','amount'])
    df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1)
    return df1