eljusticiero67 eljusticiero67 - 2 months ago 11
Python Question

Binning by value, except last bin

I am trying to bin data as follows:

pd.cut(df['col'], np.arange(0,1.2, 0.2),include_lowest=True))


But I would like to ensure that any data greater than 1 is also included in that last bin. I can do this in a couple lines, but wondering if anyone knows a one-liner/more pythonic way of doing this?

PS - I am not looking to do a qcut-- I need the bins to be separated by their values, and not the count of records.

Answer

Solution 1: prepare labels (using first 5 rows of the DF) and replace 1 with np.inf in the bins parameter:

In [67]: df
Out[67]:
          a         b         c
0  1.698479  0.337989  0.002482
1  0.903344  1.830499  0.095253
2  0.152001  0.439870  0.270818
3  0.621822  0.124322  0.471747
4  0.534484  0.051634  0.854997
5  0.980915  1.065050  0.211227
6  0.809973  0.894893  0.093497
7  0.677761  0.333985  0.349353
8  1.491537  0.622429  1.456846
9  0.294025  1.286364  0.384152

In [68]: labels = pd.cut(df.a.head(), np.arange(0,1.2, 0.2), include_lowest=True).cat.categories

In [69]: pd.cut(df.a, np.append(np.arange(0, 1, 0.2), np.inf), labels=labels, include_lowest=True)
Out[69]:
0      (0.8, 1]
1      (0.8, 1]
2      [0, 0.2]
3    (0.6, 0.8]
4    (0.4, 0.6]
5      (0.8, 1]
6      (0.8, 1]
7    (0.6, 0.8]
8      (0.8, 1]
9    (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

Explanation:

In [72]: np.append(np.arange(0, 1, 0.2), np.inf)
Out[72]: array([ 0. ,  0.2,  0.4,  0.6,  0.8,  inf])

In [73]: labels
Out[73]: Index(['[0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]', '(0.8, 1]'], dtype='object')

Solution 2: clip all values greater than 1

In [70]: pd.cut(df.a.clip(upper=1), np.arange(0,1.2, 0.2),include_lowest=True)
Out[70]:
0      (0.8, 1]
1      (0.8, 1]
2      [0, 0.2]
3    (0.6, 0.8]
4    (0.4, 0.6]
5      (0.8, 1]
6      (0.8, 1]
7    (0.6, 0.8]
8      (0.8, 1]
9    (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

Explanation:

In [75]: df.a
Out[75]:
0    1.698479
1    0.903344
2    0.152001
3    0.621822
4    0.534484
5    0.980915
6    0.809973
7    0.677761
8    1.491537
9    0.294025
Name: a, dtype: float64

In [76]: df.a.clip(upper=1)
Out[76]:
0    1.000000
1    0.903344
2    0.152001
3    0.621822
4    0.534484
5    0.980915
6    0.809973
7    0.677761
8    1.000000
9    0.294025
Name: a, dtype: float64