D. Ross D. Ross - 1 month ago 6
Python Question

Create a new column based on several lookup tables in Python Pandas

I have a large pandas dataframe (

df_orig
) and several lookup tables (also dataframes) that correspond to each of the segments in
df_orig
.

Here's a small subset of
df_orig
:

segment score1 score2
B3 0 700
B1 0 120
B1 400 950
B1 100 220
B1 200 320
B1 650 340
B5 300 400
B5 0 320
B1 0 240
B1 100 360
B1 940 700
B3 100 340


And here's a lookup table in its entirety for segment B5 called
thresholds_b5
(there is a lookup table for each segment in the large dataset):

score1 score2
990 220
980 280
970 200
960 260
950 260
940 200
930 240
920 220
910 220
900 220
850 120
800 220
750 220
700 120
650 200
600 220
550 220
500 240
400 240
300 260
200 300
100 320
0 400


I want to create a new column in my large dataset that is analagous to this SQL logic:

case when segment = 'B5' then
case when score1 = 990 and score2 >= 220 then 1
case when score1 = 980 and score2 >= 280 then 1
.
.
.
else 0
case when segment = 'B1' then
.
.
.
else 0 end as indicator


I was able to get the correct output using a loop based on the solution to this question:

df_b5 = df_orig[df_orig.loc[:,'segment'] == 'B5']

for i,row in enumerate(thresholds_b5):

value1 = thresholds_b5.iloc[i,0]
value2 = thresholds_b5.iloc[i,1]

df_b5.loc[(df_b5['score1'] == value1) & (df_b5['score2'] >= value2), 'indicator'] = 1


However, I'd need another loop to run this for each segment and then append all of the resultant dataframes back together, which is a bit messy. Furthermore, while I only have three segments (B1,B3,B5) for now, I'm going to have 20+ segments in the future.

Is there a way to do this more succinctly and preferably without loops? I've been warned that loops over dataframes tend to be slow and given the size of my dataset I think speed will matter.

Answer

If you are ok with sorting the DataFrames ahead of time, then you can replace your loop example with the new asof join in pandas 0.19:

# query
df_b5 = df_orig.query('segment == "B5"')

# sort ahead of time
df_b5.sort_values('score2', inplace=True)
threshold_b5.sort_values('score2', inplace=True)

# set the default indicator as 1
threshold_b5['indicator'] = 1

# join the tables
df = pd.merge_asof(df_b5, threshold_b5, on='score2', by='score1')

# fill missing indicators as 0
df.indicator = np.int64(df.indicator.fillna(0.0))

This is what I got:

  segment  score1  score2  indicator
0      B5       0     320          0
1      B5     300     400          1

If you need the original order, then save the index in a new column of df_orig and then resort the final DataFrame by that.

Of course, what you'd really want to do is concat all of your thresholds with the segment column set for each one, then invoke merge_asof with by=['segment', 'score1']. Unfortunately the current function doesn't allow for multiple by parameters.

Comments