piRSquared - 1 year ago 34

Python Question

I'll start with 3 simple examples:

`pd.DataFrame([[True]]).sum()`

0 1

dtype: int64

`pd.DataFrame([True]).sum()`

0 1

dtype: int64

`pd.Series([True]).sum()`

1

All of these are as expected. Here is a more complicated example.

`df = pd.DataFrame([`

['a', 'A', True],

['a', 'B', False],

['a', 'C', True],

['b', 'A', True],

['b', 'B', True],

['b', 'C', False],

], columns=list('XYZ'))

df.Z.sum()

4

Also as expected. However, if I

`groupby(['X', 'Y']).sum()`

I expected it to look like:

I'm thinking bug. Is there another explanation?

Per @unutbu's answer

pandas is trying to recast as original dtypes. I had thought that maybe the group by I'd performed didn't really groupby anything. So I tried this example to test out the idea.

`df = pd.DataFrame([`

['a', 'A', False],

['a', 'B', False],

['a', 'C', True],

['b', 'A', False],

['b', 'B', False],

['b', 'C', False],

], columns=list('XYZ'))

I'll

`groupby('X')`

`sum`

`1`

`0`

`bool`

`bool`

`df.groupby('X').sum()`

Sure enough...

`bool`

But if the process is the same but the values are slightly different.

`df = pd.DataFrame([`

['a', 'A', True],

['a', 'B', False],

['a', 'C', True],

['b', 'A', False],

['b', 'B', False],

['b', 'C', False],

], columns=list('XYZ'))

df.groupby('X').sum()

lesson learned. Always use

`astype(int)`

`df.groupby('X').sum().astype(int)`

gives consistent results for either scenario.

Answer

This occurs because `_cython_agg_blocks`

calls `_try_coerce_and_cast_result`

which calls `_try_cast_result`

which tries to return a result *of the same dtype* as the original values (in this case, `bool`

).

This returns something a little peculiar when `Z`

has dtype bool (and all the groups have no more than one True value). If any of the groups have 2 or more True values, then the resulting values are floats since `_try_cast_result`

does not convert 2.0 back to a boolean.

`_try_cast_result`

does something more useful when `Z`

has dtype `int`

: Internally, the Cython aggregator used by
`df.groupby(['X', 'Y']).sum()`

returns a `result`

of dtype `float`

. Here then, `_try_cast_result`

returns the result to dtype `int`

.

Source (Stackoverflow)