user3939059 user3939059 - 3 months ago 13
Python Question

Pandas, map two dataframes, count based on condition

I have written some code to map the ids of two dataframes and if a condition matches then create a count in a specified column in the existing dataframe, I am looking for a more efficient way of calculating it.

Sample Data

import numpy as np
import pandas as pd
d = {'ID' : pd.Series([111, 222, 111, 444, 222, 111]), 'Tag' : pd.Series([1, 2, 3, 1, 2, 1])}
df1 = (pd.DataFrame(d))
print(df1)

ID Tag
0 111 1
1 222 2
2 111 3
3 444 1
4 222 2
5 111 1

d = {'ID' : pd.Series([111, 444, 666, 444, 777])}
df2 = (pd.DataFrame(d))
print(df2)
ID
0 111
1 444
2 666
3 444
4 777

df2['tag1'] = 0
df2['tag2'] = 0
df2['tag3'] = 0
‚Äč
for index, row in df2.iterrows():
for i, t in df1.iterrows():
if row['ID'] == t['ID']:
if t['Tag'] == 1:
df2.loc[index]["tag1"] += 1
elif t['Tag'] == 2:
df2.loc[index]["tag2"] += 1
elif t['Tag'] == 3:
df2.loc[index]["tag3"] += 1


Output

print(df2)
ID tag1 tag2 tag3
0 111 2 0 1
1 444 1 0 0
2 666 0 0 0
3 444 1 0 0
4 777 0 0 0


What is the most efficient way of doing this, rather than computing iteratively?

Note, df1 can contain the sample
ID
multiple times with a different value of
Tag


(df1 and df2 are large dataframes, with 50,000 rows in df1 and 15,000 in df2)

Answer

You can use crosstab with merge:

print (pd.crosstab(df1.ID, df1.Tag))   
Tag  1  2  3
ID          
111  2  0  1
222  0  2  0
444  1  0  0

print (pd.merge(df2, pd.crosstab(df1.ID, df1.Tag)
                       .add_prefix('tag')
                       .reset_index(), on='ID', how='left')
         .fillna(0)
         .astype(int))  

    ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0

Instead crosstab you can use groupby with size and unstack:

print (df1.groupby(['ID', 'Tag'])['Tag'].size().unstack())   
Tag    1    2    3
ID                
111  2.0  NaN  1.0
222  NaN  2.0  NaN
444  1.0  NaN  NaN

print (pd.merge(df2, df1.groupby(['ID', 'Tag'])['Tag'].size().unstack()
                        .add_prefix('tag')
                        .reset_index(), on='ID', how='left')
         .fillna(0)
         .astype(int))  

    ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0