Hillary Sanders Hillary Sanders - 10 months ago 319
Python Question

How to use groupby transform across multiple columns

I have a big dataframe, and I'm grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).

Here's an example dataframe:

foo_function = lambda x: np.sum(x.a+x.b)

df = pd.DataFrame({'a':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],

# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object

apparently isn't able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use
and then create
by using
, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.

--> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn't exist, what's the best hack?


The error message:

TypeError: cannot concatenate a non-NDFrame object

suggests that in order to concatenate, the foo_function should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:

In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
    a   b
0  12  12
1  12  12
2  12  12
3   8   8
4  22  22
5  22  22