lnNoam - 1 year ago 66
Python Question

Sum amount associated with item in a column of lists

I have a dataframe as follows:

``````country   letter             keywords                 amount
c         y     ['fruits', 'apples', "banana"]       700
c         y     ["music", "dance", "banana"]         150
c         y     ['loud', "dance", "apples"]          350
``````

I'd like to sum the amount associated with each keyword.
Note:
`country`
and
`letters`
are not always the same, as in the contrived data above. Also, the list of
`keywords`
vary in size.

I've tried several solutions. I've attached my fastest one below. I've also tried solutions with
`apply`
and
`defaultdicts`
...

``````keywords_list = []
for i in zip(*[df[c] for c in df.columns]):
data = list(i[0:2])
for k in i[2]:
row = [k] + data + [i[-1]]
keywords_list(row)

df_expanded = pd.DataFrame(keywords_list)
df_expanded.groupby(list(range(3)))[3].sum().reset_index()
``````

Goal

``````  country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150
``````

Edit: corrected error in example of goal

Data

``````country = list("ccc")
letters = list("yyy")
keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]]
amount = [700, 150, 350]

df = pd.DataFrame({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount})
df = df[['country', 'letter', 'keywords', 'amount']]
``````

You can use:

``````df1 = pd.DataFrame(df.keywords.values.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('keywords')
print (df1)
0    fruits
0    apples
0    banana
1     music
1     dance
1    banana
2      loud
2     dance
2    apples
Name: keywords, dtype: object

print (df.drop('keywords', axis=1).join(df1).reset_index(drop=True))
country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples
``````

Another solution:

``````df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index) \
.stack() \
.reset_index(name='keywords') \
.drop('level_3',axis=1)
print (df1)
country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples
``````

Then need `groupby` with aggregating `sum`:

``````print (df.groupby(['country','letter','keywords'], as_index=False)['amount'].sum())
country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150
``````

Timings:

``````In [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.Series).stack().reset_index().drop('level_3',1))
1 loop, best of 3: 4.55 s per loop

In [48]: %timeit (jez1(df3))
10 loops, best of 3: 24.8 ms per loop

In [49]: %timeit (jez2(df3))
10 loops, best of 3: 29.7 ms per loop
``````

Code for timings:

``````df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()

def jez1(df):
df1 = pd.DataFrame(df.keywords.values.tolist()).stack().reset_index(level=1, drop=True).rename('keywords')
return df.drop('keywords', axis=1).join(df1).reset_index(drop=True)

def jez2(df):
df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1)
return df1
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download