user2242044 user2242044 - 3 years ago 126
Python Question

Merging multiple dataframes based on rows creates extra columns

I have a

MasterList
dataframe
that I merge other data sets into via a loop. Each time I merge a new column is created as with
_x
or
_y
. How can I just keep these as one column?

import pandas as pd
MasterList = pd.DataFrame(data = [['0001'],['0002'], ['0003'], ['0004']], columns = ['Order Number'])

customer_file1 = pd.DataFrame(data = [['0003', 'M'], ['0004', 'W']], columns = ['Order Number', 'Day'])
customer_file2 = pd.DataFrame(data = [['0001', 'T'], ['0002', 'S']], columns = ['Order Number', 'Day'])


for x in [customer_file1, customer_file2]:
MasterList = pd.merge(MasterList, x, how='left',left_on= 'Order Number',right_on= 'Order Number')


print MasterList


Output:

Order Number Day_x Day_y
0 0001 NaN T
1 0002 NaN S
2 0003 M NaN
3 0004 W NaN


Desired Output:

Order Number Day
0 0001 T
1 0002 S
2 0003 M
3 0004 W





Edit: People wanted more data as I over simplified my example:
I know year and day don't really make sense in the data set buy that is ok. Each customer file really come from a query of a different database, so I'd like to make the query from the database then merge in the data and forget about it rather than querying all the customer databases, concatenating, then merging.

import pandas as pd
MasterList = pd.DataFrame(data = [['0001', '2015'],['0002', '2015'], ['0003', '2016'], ['0004', '2015'], ['0005', '2017'], ['0006', '2018']], columns = ['Order Number', 'Year'])

customer_file1 = pd.DataFrame(data = [['0003', 'M'], ['0004', 'W']], columns = ['Order Number', 'Day'])
customer_file2 = pd.DataFrame(data = [['0001', 'T'], ['0002', 'S']], columns = ['Order Number', 'Day'])
customer_file3 = pd.DataFrame(data = [['0005', 'T'], ['0006', 'S']], columns = ['Order Number', 'Day'])

for x in [customer_file1, customer_file2, customer_file3]:
MasterList = pd.merge(MasterList, x, how='left', left_on='Order Number', right_on='Order Number')


print MasterList


output:

Order Number Year Day_x Day_y Day
0 0001 2015 NaN T NaN
1 0002 2015 NaN S NaN
2 0003 2016 M NaN NaN
3 0004 2015 W NaN NaN
4 0005 2017 NaN NaN T
5 0006 2018 NaN NaN S


Desired output:

Order Number Year Day
0 0001 2015 T
1 0002 2015 S
2 0003 2016 M
3 0004 2015 W
4 0005 2017 T
5 0006 2018 S

Answer Source

Option 1] Use map and combine_first

In [5044]: MasterList['Day'] = np.nan
      ...: for x in [customer_file1, customer_file2]:
      ...:     day = MasterList['Order Number'].map(x.set_index('Order Number')['Day'])
      ...:     MasterList['Day'] = MasterList['Day'].combine_first(day)
      ...:

In [5045]: MasterList
Out[5045]:
  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W

Options 2]

Use merge and append

In [5032]: MasterList.merge(customer_file1.append(customer_file2))
Out[5032]:
  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W

Or use merge and concat

In [5033]: MasterList.merge(pd.concat([customer_file1, customer_file2]))
Out[5033]:
  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download