user3939059 - 4 years ago 273
Python Question

# Pandas, map two dataframes, count based on condition

I have written some code to map the ids of two dataframes and if a condition matches then create a count in a specified column in the existing dataframe, I am looking for a more efficient way of calculating it.

Sample Data

``````import numpy as np
import pandas as pd
d = {'ID' : pd.Series([111, 222, 111, 444, 222, 111]), 'Tag' : pd.Series([1, 2, 3, 1, 2, 1])}
df1 = (pd.DataFrame(d))
print(df1)

ID  Tag
0  111    1
1  222    2
2  111    3
3  444    1
4  222    2
5  111    1

d = {'ID' : pd.Series([111, 444, 666, 444, 777])}
df2 = (pd.DataFrame(d))
print(df2)
ID
0  111
1  444
2  666
3  444
4  777

df2['tag1'] = 0
df2['tag2'] = 0
df2['tag3'] = 0
​
for index, row in df2.iterrows():
for i, t in df1.iterrows():
if row['ID'] == t['ID']:
if t['Tag'] == 1:
df2.loc[index]["tag1"] += 1
elif t['Tag'] == 2:
df2.loc[index]["tag2"] += 1
elif t['Tag'] == 3:
df2.loc[index]["tag3"] += 1
``````

Output

``````print(df2)
ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0
``````

What is the most efficient way of doing this, rather than computing iteratively?

Note, df1 can contain the sample
`ID`
multiple times with a different value of
`Tag`

(df1 and df2 are large dataframes, with 50,000 rows in df1 and 15,000 in df2)

You can use `crosstab` with `merge`:

``````print (pd.crosstab(df1.ID, df1.Tag))
Tag  1  2  3
ID
111  2  0  1
222  0  2  0
444  1  0  0

print (pd.merge(df2, pd.crosstab(df1.ID, df1.Tag)
.reset_index(), on='ID', how='left')
.fillna(0)
.astype(int))

ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0
``````

Instead `crosstab` you can use `groupby` with `size` and `unstack`:

``````print (df1.groupby(['ID', 'Tag'])['Tag'].size().unstack())
Tag    1    2    3
ID
111  2.0  NaN  1.0
222  NaN  2.0  NaN
444  1.0  NaN  NaN

print (pd.merge(df2, df1.groupby(['ID', 'Tag'])['Tag'].size().unstack()
.reset_index(), on='ID', how='left')
.fillna(0)
.astype(int))

ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download