Ivan Ivan - 4 months ago 42
Python Question

Pandas cumsum on groupby not behaving as expected

I have a dataframe like this:

df = pd.DataFrame({'prob':np.random.uniform(0,1,size), 'target':np.random.randint(0,2, size=size),
'pred':np.random.randint(0,2, size=size)})


That I want to compute
cumsum
of a
groupby
of a
qcut
:

df['box'] = pd.qcut(df['prob'], 10)


My expectation would be to calculate the cumulative function for each group, in order, but instead is calculating a sum for each element:

df['target_1'] = 1- df['target']
ch_curve = df.groupby('box').target.cumsum()/float(df.target.sum())
nch_curve = df.groupby('box').target_1.cumsum()/float(df.target_1.sum())


with the answer

0 0.000000
1 0.018182
2 0.018182
3 0.018182
4 0.000000
5 0.018182
6 0.018182
7 0.018182
8 0.036364
9 0.018182
10 0.000000
11 0.018182
12 0.018182
13 0.036364
14 0.000000
15 0.036364
16 0.036364
17 0.036364
18 0.054545
19 0.000000
20 0.000000
21 0.018182
22 0.018182
23 0.05454


instead of

'(0.0, 0.1)' 0.04
'(0.1, 0.2)' 0.12 #(0.08 + previous 0.04 )
'(0.2, 0.3)' 0.17 #(0.05 + previous 0.12 )

Answer

You want to calculate the percentage for each group and then take the cumsum.

In your original code df.groupby('box').target.cumsum() will take the cumsum of each group - so you will have one element for each of the elements in the grouped DataFrame. Then the division will be broadcast across all of these elements.

Instead you want to get one summary statistic for each group and then take the cumsum.

Here is the code that will do what you are after:

ch_curve = (df.groupby('box').target.sum() / df.target.sum()).cumsum()
nch_curve = (df.groupby('box').target_1.sum() / df.target_1.sum()).cumsum()
Comments