Hillary Sanders Hillary Sanders - 6 months ago 125
Python Question

How to use groupby transform across multiple columns

I have a big dataframe, and I'm grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).

Here's an example dataframe:

foo_function = lambda x: np.sum(x.a+x.b)

df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})

# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object


But
transform
apparently isn't able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use
apply
and then create
df['new_col']
by using
pd.match
, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.

--> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn't exist, what's the best hack?

Answer

The error message:

TypeError: cannot concatenate a non-NDFrame object

suggests that in order to concatenate, the foo_function should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:

In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]: 
    a   b
0  12  12
1  12  12
2  12  12
3   8   8
4  22  22
5  22  22