munk munk - 5 months ago 75
Python Question

Pandas - The difference between join and merge

Suppose I have two DataFrames like so:

left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})


I want to merge them, so I try something like this:

pd.merge(left, right, left_on='key1', right_on='key2')


And I'm happy

key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5


But I'm trying to use the join method, which I've been lead to believe is pretty similar.

left.join(right, on=['key1', 'key2'])


And I get this:

//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:

AssertionError:


What am I missing?

Answer

I always use join on indices:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

     lval  rval
key            
foo     1     4
bar     2     5

First, join expects a single, common column. However, that doesn't seem to help here. The follow should, IMO, work. Unfortunately, it does not:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left.join(right, on=('key'), lsuffix='_l', rsuffix='_r')

  key_l  lval key_r  rval
0   foo     1   NaN   NaN
1   bar     2   NaN   NaN

Specifying an outer join is even more confusing:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left.join(right, on='key', lsuffix='_l', rsuffix='_r', how='outer')

   key key_l  lval key_r  rval
0  foo   foo     1   NaN   NaN
1  bar   bar     2   NaN   NaN
1    0   NaN   NaN   foo     4
1    1   NaN   NaN   bar     5

Look at that crazy index 0_o

My thoughts are that merge is for columns and join is for indices.

I don't know what else to say :/

Comments