Dinosaurius Dinosaurius - 2 months ago 9
Python Question

Merging results are wrong

I have two pandas DataFrames:

df1:

cid day total_count
0 2 2017-06-01 1
1 2 2017-03-04 1
2 1 2017-04-07 1
3 4 2017-06-25 1
4 5 2017-03-18 2
4 3 2017-03-18 2
4 1 2017-03-18 2
4 5 2017-03-18 2

df2 = pd.DataFrame(columns=["cid","pid","lat","lon"], data=[[1,1,41.485731,3.2409], [2,2,41.49206,3.22573],[3,3,41.494026,3.22354],[4,4,41.495904,3.14504],[5,5,41.50271,3.12575]])


I want just to add two columns
lat
and
lon
to the table
df1
from table
df2
.

I tried to do it this way:

result = pd.merge(df1, df2, left_on='cid', right_index=True, how='left', sort=False)


But I get a wrong result (
result.head()
):

cid_x day total_count cid_y pid lat lon
0 2 2017-06-01 1 1.0 1.0 41.475215 3.23462
1 2 2017-03-04 1 1.0 1.0 41.501326 3.41505
2 1 2017-04-07 1 2.0 2.0 41.484948 3.34780
3 4 2017-06-25 1 5.0 5.0 41.492983 3.43865
4 5 2017-03-18 1 3.0 3.0 41.502776 3.35977


First of all I do not understand why do I get two columns
cid_x
and
cid_y
instead of
cid
? Secondly I misunderstand why the values of
cid_x
and
cid_y
are different of each row? Shouldn't the
merge
command merge rows from
df1
and
df2
based on
cid
?

UPDATE:

I tried to show the issue based on a dummy data.

Answer Source

The way you did your join is the reason. You're using the cid as the join key from your left df, while you're using the index from the right df. Hence, your pseudo join SQL would be something like: on left.cid = right.index

If you want to join on cid for both df's, then just use the simple on argument:

result = pd.merge(df1, df2, on='cid', how='left')