Ivan - 1 year ago 215
Python Question

# Pandas cumsum on groupby not behaving as expected

I have a dataframe like this:

``````df = pd.DataFrame({'prob':np.random.uniform(0,1,size), 'target':np.random.randint(0,2, size=size),
'pred':np.random.randint(0,2, size=size)})
``````

That I want to compute
`cumsum`
of a
`groupby`
of a
`qcut`
:

``````df['box'] = pd.qcut(df['prob'], 10)
``````

My expectation would be to calculate the cumulative function for each group, in order, but instead is calculating a sum for each element:

``````df['target_1'] = 1- df['target']
ch_curve = df.groupby('box').target.cumsum()/float(df.target.sum())
nch_curve = df.groupby('box').target_1.cumsum()/float(df.target_1.sum())
``````

``````0     0.000000
1     0.018182
2     0.018182
3     0.018182
4     0.000000
5     0.018182
6     0.018182
7     0.018182
8     0.036364
9     0.018182
10    0.000000
11    0.018182
12    0.018182
13    0.036364
14    0.000000
15    0.036364
16    0.036364
17    0.036364
18    0.054545
19    0.000000
20    0.000000
21    0.018182
22    0.018182
23    0.05454
``````

``````'(0.0, 0.1)'    0.04
'(0.1, 0.2)'    0.12 #(0.08 + previous 0.04 )
'(0.2, 0.3)'    0.17 #(0.05 + previous 0.12 )
``````

In your original code `df.groupby('box').target.cumsum()` will take the cumsum of each group - so you will have one element for each of the elements in the grouped DataFrame. Then the division will be broadcast across all of these elements.
``````ch_curve = (df.groupby('box').target.sum() / df.target.sum()).cumsum()