user791411 user791411 - 26 days ago 14
Python Question

Lookup value in dictionary between range pandas

I have a "lookup" table formatted as such:

Min | Max | Val
1 | 99 | "Principal"
100 | 199 | "Partner"
... | ... | ...


There is a
CURRENT_POINTS
series in my dataframe that is between Min or Max (inclusive).

Question: how do I create a
VAL
column that is based on the above lookup table? My initial thought was to use
df.lookup
, but there are 800K rows in
df
and so the two tables are not equally sized.

Thanks in advance for your help!

Any thoughts?

Answer

I would use cut() method.

Assuming you have the following DFs:

In [187]: lkp
Out[187]:
   Min  Max  Val
0    1   99  AAA
1  100  199  BBB
2  200  299  CCC
3  300  399  DDD

In [188]: df
Out[188]:
   CURRENT_POINTS
0              55
1              10
2              20
3             144
4             194
5             143
6             397
7             233
8             128
9             215

Using cut() method we can produce a new column of a category dtype, which might save a lot of memory:

In [189]: df['Val'] = pd.cut(df.CURRENT_POINTS,
     ...:                    bins=[0] + lkp[['Min','Max']].stack()[1::2].tolist(),
     ...:                    labels=lkp.Val.tolist())
     ...:

In [190]: df
Out[190]:
   CURRENT_POINTS  Val
0              55  AAA
1              10  AAA
2              20  AAA
3             144  BBB
4             194  BBB
5             143  BBB
6             397  DDD
7             233  CCC
8             128  BBB
9             215  CCC

In [191]: df.dtypes
Out[191]:
CURRENT_POINTS       int32
Val               category
dtype: object

Category dtype can save a lot of memory:

In [192]: big = pd.concat([df] * 10**5, ignore_index=True)

In [193]: big.shape
Out[193]: (1000000, 2)

In [194]: big['str_col'] = 'AAA'

In [198]: big.dtypes
Out[198]:
CURRENT_POINTS       int32
Val               category
str_col             object
dtype: object

In [195]: big.memory_usage()
Out[195]:
Index                  80
CURRENT_POINTS    4000000
Val               1000032     # <--- `category` column takes 1 byte per row (plus 32 bytes overhead)
str_col           8000000

In [197]: big.head()
Out[197]:
   CURRENT_POINTS  Val str_col
0              55  AAA     AAA
1              10  AAA     AAA
2              20  AAA     AAA
3             144  BBB     AAA
4             194  BBB     AAA

NOTE: pay attention at memory usage for the category column Val and for the str_col column (dtype: object)

Explanation:

bins:

In [199]: lkp[['Min','Max']]
Out[199]:
   Min  Max
0    1   99
1  100  199
2  200  299
3  300  399

In [200]: lkp[['Min','Max']].stack()
Out[200]:
0  Min      1
   Max     99
1  Min    100
   Max    199
2  Min    200
   Max    299
3  Min    300
   Max    399
dtype: int64

In [201]: lkp[['Min','Max']].stack()[1::2].tolist()
Out[201]: [99, 199, 299, 399]

In [202]: [0] + lkp[['Min','Max']].stack()[1::2].tolist()
Out[202]: [0, 99, 199, 299, 399]

labels:

In [203]: lkp.Val.tolist()
Out[203]: ['AAA', 'BBB', 'CCC', 'DDD']

NOTE: lkp must be sorted by ['Min', 'Max'] before using it for bins and labels