Amelio Vazquez-Reina Amelio Vazquez-Reina - 10 days ago 4
Python Question

Apply vs transform on a group object

Consider the following dataframe:

A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922


The following commands work:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())


but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object


Why? The example on the documentation seems to suggest that calling
transform
on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)


In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})

Answer

As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.

So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

Consider this example (on your dataframe):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

will yield:

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

Which is exactly the same as if you would use it on only on one column at a time:

df.groupby('A')['C'].transform(zscore)

yielding:

0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509

Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

df.groupby('A').apply(zscore)

gives error:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

yielding:

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

Trying the same with .apply would give NaNs in sum_C. Because .apply would return a reduced Series, which it does not know how to broadcast back:

df.groupby('A')['C'].apply(sum)

giving:

A
bar    3.973
foo    4.373

There are also cases when .transform is used to filter the data:

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

I hope this adds a bit more clarity.