Lucas Lucas - 3 months ago 44
Python Question

How to overwrite identical column names when performing an "outer" join in Pandas?

I am trying to merge/join two csv's, based on a unique

city
/
country
/
state
column combination using Pandas. However, when I try to do this using an outer join, I am getting extra columns when instead I would prefer to have the "right" side of my join overwrite the columns in the "left" side of the join. Any suggestions?

Here is my attempt, with an example:

These are my csv's:

My "left" csv file:

| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+---------+---------|
| beijing | cn | 22 | 456 | 456 | 456 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 |


My "right" csv file:

| city | country | state | pop | lat | long |
|-------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |


and I want this result:

| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |


Note that
mexico city
and
beijing
are considered matches, based on their
city
,
country
, and
state
columns. Also note that on these matching rows, each column from my "left" csv is overwritten by the matching column from my "right" csv.

So here is my attempt using Pandas and dataframes:

left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')

result = pd.merge(left, right, on=['city', 'country', 'state'], how='outer')


Unfortunately, here is my result:

| city | country | state | pop_x | lat_x | long_x | pop_y | lat_y | long_y |
|--------------+---------+-------+----------+-----------+------------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 | 32707 | 33.219442 | -86.823907 |
| albertville | us | al | | 34.26313 | -86.21066 | | 34.26313 | -86.21066 |
| beijing | cn | 22 | 456 | 456 | 456 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 | 12294193 | 19.428 | -99.128 |
| mumbai | in | 16 | 12691836 | 19.073 | 72.883 | 12691836 | 19.073 | 72.883 |
| shanghai | cn | 23 | 22315474 | 31.222 | 121.458 | 22315474 | 31.222 | 121.458 |


As shown above, the columns that are not being used for the join, and which have the same name, are renamed with a
_x
suffix for the "left" dataframe and a
_y
suffix for the "right" dataframe.

Is there a simple way to make the columns from the "right" dataframe to overwrite the columns from the "left" dataframe when matched?




Although there seem to be similar questions already out there, I still can't seem to find an answer. For example, I tried implementing the solution based on this question:

left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
left = left.set_index(['city','country','state'])
right = right.set_index(['city','country','state'])
left.update(right)


But
update
only performs left joins, so the resulting dataframe only has the same rows from the left dataframe, so it is missing cities like
adamsville
and
alabaster
above.

Answer

Since the column names for both dataframes are the same you could stack them and then do a drop_duplicates or groupby

For example:

result = pd.concat([left, right]).reset_index()
result.drop_duplicates(['city','country','state'], keep='first', inplace=True)

or:

df_stacked = pd.concat([left, right]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()

Calling first will take the values from the "left" df over the "right" df because we're stacking the "left" df on top of the "right" df and resetting the index

Using groupby will allow you to perform more complex selects on the aggregated records if you don't want to just take the first or last record.

EDIT:

Just realized you want the "right" df to overwrite the "left" df, in that case...

df_stacked = pd.concat([right, left]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()

This methodology only works if the "left" and "right" dataframes don't contain duplicate records to start.


And for the record, to get to the csv solution in the example above, we can perform the following:

result = result.reset_index()
# sort our descending population, and if populations are equal (or NaN), sort by ascending city name
result = result.sort_values(['pop', 'city'], ascending=[False, True])
result.drop('index', axis=1, inplace=True)
result.to_csv('result.csv', index=False)