cyril cyril - 3 months ago 24
Python Question

Pandas: merge dataframes and create conditional new conditional coluns

I have two large pandas dataframes (with millions of rows) that contain two columns, a group identifier and an id. I'm trying to create a combined dataframe that contains the group, id, plus a column with a 1 if the id was in the first dataframe, else a 0 and a column with a 1 if the id was in the second dataframe, else a 0.

In other words, I'm trying to merge the two dataframes and create conditional columns based on if the id was present for each original dataframe. Any suggestions on how to approach this problem?

Here is a small example:

import pandas as pd

>>> df_a = pd.DataFrame({'group': list('AAABBB'), 'id': [11,12,13,21,22,23]})
>>> df_b = pd.DataFrame({'group': list('AAABB'), 'id': [11,13,14,22,24]})
>>> df_a

group id
A 11
A 12
A 13
B 21
B 22
B 23

>>> df_b

group id
A 11
A 13
A 14
B 22
B 24


The output should look like this:

>>> df_full
group id a b
A 11 1 1
A 12 1 0
A 13 1 1
A 14 0 1
B 21 1 0
B 22 1 1
B 23 1 0
B 24 0 1

Answer

You can create two columns for each of the data frame with one before merging and fill na with zero after merging:

df_a['a'] = 1
df_b['b'] = 1

pd.merge(df_a, df_b, how = 'outer', on = ['group', 'id']).fillna(0)

# group   id      a   b
# 0   A 11.0    1.0 1.0
# 1   A 12.0    1.0 0.0
# 2   A 13.0    1.0 1.0
# 3   B 21.0    1.0 0.0
# 4   B 22.0    1.0 1.0
# 5   B 23.0    1.0 0.0
# 6   A 14.0    0.0 1.0
# 7   B 24.0    0.0 1.0