spiderface spiderface - 3 months ago 44
Python Question

Explain how pandas DataFrame join works

Why does inner join work so strange in pandas?

For example:

import pandas as pd
import io

t1 = ('key,col1\n'

t2 = ('key,col2\n'

df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)

print(df2.join(df1, on='key', how='inner', lsuffix='_l'))


key col1
0 1 a
1 2 b
2 3 c
3 4 d

key col2
0 1 e
1 2 f
2 3 g
3 4 h

key_l col2 key col1
0 1 e 2 b
1 2 f 3 c
2 3 g 4 d

If I don't specify
, it says

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')

Does this function work differently from SQL's JOIN? Why does it want to create an extra 'key' column with a suffix? Why are there only 3 rows?
I expected it to output something like this:

key col1 col2
0 1 a e
1 2 b f
2 3 c g
3 4 d h


First things first:
What you wanted was merge


enter image description here

join defaults to merging on the index. You can specify the on parameter which only says which column from left side to match with the index of the right side.

These might help illustrate


enter image description here

df1.join(df2.set_index('key'), on='key')

enter image description here

Your example is matching the index of df2 which looks like [0, 1, 2, 3] with the key column of df1 which looks like [1, 2, 3, 4]
That's why you get NaN in col2 when key_l is 4

df1.join(df2, on='key', lsuffix='_l', how='outer')

enter image description here