Khris - 3 months ago 18
Python Question

Pandas groupby-median function fills empty bins with random numbers

I am learning the different aspects of Python Pandas and I stumbled over some odd behaviour of the median function for groupby-objects when it's used on binned data.

Example Code:

``````import pandas as pd

d = pd.DataFrame([1,2,5,6,9,3,6,5,9,7,11,36,4,7,8,25,8,24,23])

b = [0,5,10,15,20,25,30,35,40,45,50,55]

print d.groupby(pd.cut(d[0],b)).count()

print d.groupby(pd.cut(d[0],b)).mean()

print d.groupby(pd.cut(d[0],b)).median()
``````

Output for count:

``````(0, 5]    6
(5, 10]   8
(10, 15]  1
(15, 20]  0
(20, 25]  3
(25, 30]  0
(30, 35]  0
(35, 40]  1
(40, 45]  0
(45, 50]  0
(50, 55]  0
``````

Output for mean:

``````(0, 5]     3.333333
(5, 10]    7.500000
(10, 15]  11.000000
(15, 20]        NaN
(20, 25]  24.000000
(25, 30]        NaN
(30, 35]        NaN
(35, 40]  36.000000
(40, 45]        NaN
(45, 50]        NaN
(50, 55]        NaN
``````

Output for median:

``````(0, 5]     3.5
(5, 10]    7.5
(10, 15]  11.0
(15, 20]  18.0
(20, 25]  24.0
(25, 30]  30.5
(30, 35]  30.5
(35, 40]  36.0
(40, 45]  18.0
(45, 50]  18.0
(50, 55]  18.0
``````

All empty bins are filled with the numbers 18 and 30.5 which make no real sense here.

Also the last three numbers were changing randomly when I changed one number in the original list, then I got output like this:

``````(0, 5]     3.500000e+00
(5, 10]    7.500000e+00
(10, 15]   1.100000e+01
(15, 20]   1.800000e+01
(20, 25]   2.450000e+01
(25, 30]   3.050000e+01
(30, 35]   3.050000e+01
(35, 40]   3.600000e+01
(40, 45]  3.814316e+228
(45, 50]  3.814316e+228
(50, 55]  3.814316e+228
``````

Changing another number in the list would give me output with the number 18 at the end again.

Is that just a bug?

Are there valid reasons for this behaviour?

Am I doing or interpreting something wrong here?

Right now I need to use the mean-function's NaN-output to filter out empty median-bins, but I think the median should treat empty values the same as the mean.

I'm pretty sure it's a bug:

Consider:

``````gb = d.groupby(pd.cut(d[0],b))

gb.median()
``````

but:

``````gb.get_group('(0, 5]').median()

0    3.5
dtype: float64
``````

and:

``````gb.get_group('(15, 20]').median()
``````
``````KeyError                                  Traceback (most recent call last)
<ipython-input-314-e1f4657d9a2d> in <module>()
----> 1 gb.get_group('(15, 20]').median()

/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in get_group(self, name, obj)
585         inds = self._get_index(name)
586         if not len(inds):
--> 587             raise KeyError(name)
588
589         return obj.take(inds, axis=self.axis, convert=False)

KeyError: '(15, 20]'
``````

It calculates `median` on the `groupby` object when the group doesn't even exist.