spiderface spiderface - 2 months ago 27
Python Question

Explain how pandas DataFrame join works

Why does inner join work so strange in pandas?

For example:

import pandas as pd
import io

t1 = ('key,col1\n'
'1,a\n'
'2,b\n'
'3,c\n'
'4,d')

t2 = ('key,col2\n'
'1,e\n'
'2,f\n'
'3,g\n'
'4,h')


df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)

print(df1)
print()
print(df2)
print()
print(df2.join(df1, on='key', how='inner', lsuffix='_l'))


Outputs:

key col1
0 1 a
1 2 b
2 3 c
3 4 d

key col2
0 1 e
1 2 f
2 3 g
3 4 h

key_l col2 key col1
0 1 e 2 b
1 2 f 3 c
2 3 g 4 d


If I don't specify
lsuffix
, it says

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')


Does this function work differently from SQL's JOIN? Why does it want to create an extra 'key' column with a suffix? Why are there only 3 rows?
I expected it to output something like this:

key col1 col2
0 1 a e
1 2 b f
2 3 c g
3 4 d h

Answer

First things first:
What you wanted was merge

df1.merge(df2)

enter image description here


join defaults to merging on the index. You can specify the on parameter which only says which column from left side to match with the index of the right side.

These might help illustrate

df1.set_index('key').join(df2.set_index('key'))

enter image description here

df1.join(df2.set_index('key'), on='key')

enter image description here


Your example is matching the index of df2 which looks like [0, 1, 2, 3] with the key column of df1 which looks like [1, 2, 3, 4]
That's why you get NaN in col2 when key_l is 4

df1.join(df2, on='key', lsuffix='_l', how='outer')

enter image description here

Comments