milleractual milleractual - 4 months ago 18
Python Question

Speeding up cross-reference filtering in Pandas DB

I am working with a very large donation database of data with relevant columns for donation ID, conduit ID, amount, for example:

TRANSACTION_ID BACK_REFERENCE_TRAN_ID_NUMBER CONTRIBUTION_AMOUNT
0 VR0P4H2SEZ1 0 100
1 VR0P4H3X770 0 2700
2 VR0P4GY6QV1 0 500
3 VR0P4H3X720 0 1700
4 VR0P4GYHHA0 VR0P4GYHHA0E 200


What I need to do is to identify all of the rows where the TRANSACTION_ID corresponds to any BACK_REFERENCE_TRAN_ID_NUMBER. My current code, albeit a little clumsy, is:

is_from_conduit = df[df.BACK_REFERENCE_TRAN_ID_NUMBER != "0"].BACK_REFERENCE_TRAN_ID_NUMBER.tolist()
df['CONDUIT_FOR_OTHER_DONATION'] = 0
for row in df.index:
if df['TRANSACTION_ID'][row] in is_from_conduit:
df['CONDUIT_FOR_OTHER_DONATION'][row] = 1
else:
df['CONDUIT_FOR_OTHER_DONATION'][row] = 0


However, on very large data sets with a large number of conduit donations, this takes for ever. I know there must be a simpler way, but clearly I can't come up with how to phrase this to find out what that may be.

Answer

You can use Series.isin. It is a vectorized operation that checks if each element of the Series is in a supplied iterable.

df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique())

As @root mentioned if you prefer 0/1 (as in your example) instead of True/False, you can cast to int:

df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique()).astype(int)
Comments