Ivan - 1 year ago 86

Python Question

I have a dataframe like this:

`df = pd.DataFrame({'prob':np.random.uniform(0,1,size), 'target':np.random.randint(0,2, size=size),`

'pred':np.random.randint(0,2, size=size)})

That I want to compute

`cumsum`

`groupby`

`qcut`

`df['box'] = pd.qcut(df['prob'], 10)`

My expectation would be to calculate the cumulative function for each group, in order, but instead is calculating a sum for each element:

`df['target_1'] = 1- df['target']`

ch_curve = df.groupby('box').target.cumsum()/float(df.target.sum())

nch_curve = df.groupby('box').target_1.cumsum()/float(df.target_1.sum())

with the answer

`0 0.000000`

1 0.018182

2 0.018182

3 0.018182

4 0.000000

5 0.018182

6 0.018182

7 0.018182

8 0.036364

9 0.018182

10 0.000000

11 0.018182

12 0.018182

13 0.036364

14 0.000000

15 0.036364

16 0.036364

17 0.036364

18 0.054545

19 0.000000

20 0.000000

21 0.018182

22 0.018182

23 0.05454

instead of

`'(0.0, 0.1)' 0.04`

'(0.1, 0.2)' 0.12 #(0.08 + previous 0.04 )

'(0.2, 0.3)' 0.17 #(0.05 + previous 0.12 )

Answer Source

You want to calculate the percentage for each group and **then** take the cumsum.

In your original code `df.groupby('box').target.cumsum()`

will take the cumsum of each group - so you will have one element for each of the elements in the grouped DataFrame. Then the division will be broadcast across all of these elements.

Instead you want to get one summary statistic for each group and then take the cumsum.

Here is the code that will do what you are after:

```
ch_curve = (df.groupby('box').target.sum() / df.target.sum()).cumsum()
nch_curve = (df.groupby('box').target_1.sum() / df.target_1.sum()).cumsum()
```