Khris Khris - 4 months ago 26
Python Question

Pandas groupby-median function fills empty bins with random numbers

I am learning the different aspects of Python Pandas and I stumbled over some odd behaviour of the median function for groupby-objects when it's used on binned data.

Example Code:

import pandas as pd

d = pd.DataFrame([1,2,5,6,9,3,6,5,9,7,11,36,4,7,8,25,8,24,23])

b = [0,5,10,15,20,25,30,35,40,45,50,55]

print d.groupby(pd.cut(d[0],b)).count()

print d.groupby(pd.cut(d[0],b)).mean()

print d.groupby(pd.cut(d[0],b)).median()


Output for count:

(0, 5] 6
(5, 10] 8
(10, 15] 1
(15, 20] 0
(20, 25] 3
(25, 30] 0
(30, 35] 0
(35, 40] 1
(40, 45] 0
(45, 50] 0
(50, 55] 0


Output for mean:

(0, 5] 3.333333
(5, 10] 7.500000
(10, 15] 11.000000
(15, 20] NaN
(20, 25] 24.000000
(25, 30] NaN
(30, 35] NaN
(35, 40] 36.000000
(40, 45] NaN
(45, 50] NaN
(50, 55] NaN


Output for median:

(0, 5] 3.5
(5, 10] 7.5
(10, 15] 11.0
(15, 20] 18.0
(20, 25] 24.0
(25, 30] 30.5
(30, 35] 30.5
(35, 40] 36.0
(40, 45] 18.0
(45, 50] 18.0
(50, 55] 18.0


All empty bins are filled with the numbers 18 and 30.5 which make no real sense here.

Also the last three numbers were changing randomly when I changed one number in the original list, then I got output like this:

(0, 5] 3.500000e+00
(5, 10] 7.500000e+00
(10, 15] 1.100000e+01
(15, 20] 1.800000e+01
(20, 25] 2.450000e+01
(25, 30] 3.050000e+01
(30, 35] 3.050000e+01
(35, 40] 3.600000e+01
(40, 45] 3.814316e+228
(45, 50] 3.814316e+228
(50, 55] 3.814316e+228


Changing another number in the list would give me output with the number 18 at the end again.

Is that just a bug?

Are there valid reasons for this behaviour?

Am I doing or interpreting something wrong here?

Right now I need to use the mean-function's NaN-output to filter out empty median-bins, but I think the median should treat empty values the same as the mean.

Answer

I'm pretty sure it's a bug:

Consider:

gb = d.groupby(pd.cut(d[0],b))

gb.median()

enter image description here

but:

gb.get_group('(0, 5]').median()

0    3.5
dtype: float64

and:

gb.get_group('(15, 20]').median()
KeyError                                  Traceback (most recent call last)
<ipython-input-314-e1f4657d9a2d> in <module>()
----> 1 gb.get_group('(15, 20]').median()

/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in get_group(self, name, obj)
    585         inds = self._get_index(name)
    586         if not len(inds):
--> 587             raise KeyError(name)
    588 
    589         return obj.take(inds, axis=self.axis, convert=False)

KeyError: '(15, 20]'

It calculates median on the groupby object when the group doesn't even exist.