user3939059 user3939059 - 3 months ago 19
Python Question

Pandas - Merge two data frames, create new column, append values to array

I am looking to merge two data frames on the same

id
in each dataframe, but to create a new column and append any values in a specified column to an array in the new dataframe column. I would expect to see multiple matching ids in the second data frame.

Here is an example to clarify what I am looking for:

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.random.randint(3, size=(5, 4)), columns=('ID', 'X1', 'X2', 'X3')))

ID X1 X2 X3
0 1 1 0 2
1 0 1 0 1
2 0 1 2 2
3 1 2 2 0
4 2 1 0 0

d = {'ID' : pd.Series([1, 2, 1, 4, 5]), 'Tag' : pd.Series(['One', 'Two', 'Two', 'Four', 'Five'])}
df2 = (pd.DataFrame(d))
print(df2)

ID Tag
0 1 One
1 2 Two
2 1 Two
3 4 Four
4 5 Five


This is what I am expecting to see for the first row:

ID X1 X2 X3 Merged_Tags
0 1 1 0 2 ['One', 'Two']


I want to join on the
id
column of df1 by looking through all of df2 for matching
ids
(there will be multiple matching ids). When a matching
id
is found, the value stored in
df2['Tag']
should be appended to a column in df1, perhaps an array.

I managed this iteratively but my dataset is relativity large and so have not found it viable.

Answer

try this:

In [35]: pd.merge(df1, df2.groupby('ID').Tag.apply(list).reset_index(), on='ID', how='left')
Out[35]:
   ID  X1  X2  X3         Tag
0   2   1   1   2       [Two]
1   1   0   1   1  [One, Two]
2   0   2   1   2         NaN
3   1   0   2   2  [One, Two]
4   0   0   2   2         NaN

alternatively you can use map() method:

In [38]: df1['Merged_Tags'] = df1.ID.map(df2.groupby('ID').Tag.apply(list))

In [39]: df1
Out[39]:
   ID  X1  X2  X3 Merged_Tags
0   2   1   1   2       [Two]
1   1   0   1   1  [One, Two]
2   0   2   1   2         NaN
3   1   0   2   2  [One, Two]
4   0   0   2   2         NaN
Comments