moustachio moustachio - 1 month ago 6
Python Question

Pandas concatenate list columns with groupby when grouping on multiple columns

I am trying to perform a simple groupby operation on a Pandas dataframe with list columns (with the goal of concatenating the lists corresponding to each group). It works fine when grouping on a single column, but for reasons I can't explain fails when grouping on two columns. A simplified example:

x = pd.DataFrame({'a':[1,1,2,2],'b':['a','a','a','b'],'c':[[1,2],[3,4],[5,6],[7,8]]})

a b c
0 1 a [1, 2]
1 1 a [3, 4]
2 2 a [5, 6]
3 2 b [7, 8]


Now, grouping on either
a
or
b
works as expected:

x.groupby('b')['c'].sum()

b
a [1, 2, 3, 4, 5, 6]
b [7, 8]
dtype: object

x.groupby('a')['c'].sum()

a
1 [1, 2, 3, 4]
2 [5, 6, 7, 8]
dtype: object


But if I try to group on
a
AND
b
(i.e.
x.groupby(['a','b'])['c'].sum()
), it invariably fails with
ValueError: Function does not reduce
.

On the surface I can't see why this should happen, as either way we're just concatenating lists, but I imagine it has something to do with Pandas internals...

Any workarounds or explanations?

Answer

I think it may be a bug, where sum fails when some rows can't be summed, the last two for example will remain split with the double grouping. The workaround is apply:

import pandas as pd
x = pd.DataFrame({'a':[1,1,2,2],'b':['a','a','a','b'],'c':[[1,2],[3,4],[5,6],[7,8]]})
print x
   a  b       c
0  1  a  [1, 2]
1  1  a  [3, 4]
2  2  a  [5, 6]
3  2  b  [7, 8]
print  x.groupby(('a'))['c'].apply(sum)
a
1    [1, 2, 3, 4]
2    [5, 6, 7, 8]
Name: c, dtype: object
print x.groupby(('a'))['c'].sum()
a
1    [1, 2, 3, 4]
2    [5, 6, 7, 8]
dtype: object
print x.groupby(('a','b'))['c'].apply(sum)
a  b
1  a    [1, 2, 3, 4]
2  a          [5, 6]
   b          [7, 8]
Name: c, dtype: object

I think you should submit this to the pandas team as well.