Nikita Nikita - 20 days ago 8
Python Question

Pandas outer merge two versions of the same DataFrame

I want to merge two dataframes that look like this:

In[14]: test1=pd.DataFrame({'col1':[1,2,3,
6,4,5],
'col2':['First','Second','Third',
'Sixth','Fourth','Fifth']})
test1
Out[14]:

col1 col2
0 1 First
1 2 Second
2 3 Third
3 6 Sixth
4 4 Fourth
5 5 Fifth


and

In[15]: test2=pd.DataFrame({'col1':[1,7,2,
3,4,5],
'col2':['First','Seventh','Second',
'Third','Fourth','Fifth']})
test2
Out[15]:

col1 col2
0 1 First
1 7 Seventh
2 2 Second
3 3 Third
4 4 Fourth
5 5 Fifth


As you may notice, these DataFrames are almost the same, but each of them has one extra row that isn't in the other one (
3 6 Sixth
in
test1
and
1 7 Seventh
in
test2
).

I want to merge these DataFrames in such a way that any extra rows in one DataFrame are inserted into the other one as close as possible to their original location. Here's the result I'm hoping to get:

col1 col2
0 1 First
1 7 Seventh
2 2 Second
3 3 Third
4 6 Sixth
5 4 Fourth
6 5 Fifth


I tried using

In[16]: pd.merge(test1, test2, how='outer', sort=False)


This outputs

Out[16]:

col1 col2
0 1 First
1 2 Second
2 3 Third
3 6 Sixth
4 4 Fourth
5 5 Fifth
6 7 Seventh


As you can see, the second row from
test2
is now at the bottom. Calling
pd.merge(test2, test1, how='outer', sort=False)
gives a similar result, but with the fourth row of
test1
at the bottom. Sticking to the order of entries in both DataFrames is crucial to me, so this is not what I want.

I've also tried
update()
,
combine_first()
, and
replace()
, but they give an inner or left join.

How can I get
pandas
to do what I want?

Answer

You can use concat followed by drop_duplicates and sort_index:

df = pd.concat([test2, test1]).drop_duplicates().sort_index()

The resulting output:

   col1     col2
0     1    First
1     7  Seventh
2     2   Second
3     3    Third
3     6    Sixth
4     4   Fourth
5     5    Fifth

If you want the index of your new DataFrame to be unique, perform a reset_index at the end:

df = pd.concat([test2, test1]).drop_duplicates().sort_index().reset_index(drop=True)

Which gives a unique index:

   col1     col2
0     1    First
1     7  Seventh
2     2   Second
3     3    Third
3     6    Sixth
4     4   Fourth
5     5    Fifth