ale19 ale19 - 6 months ago 33
Python Question

pandas: unexpected join behavior results in NaN

I have two dataframes that I'm trying to join in pandas (version 0.18.1).

test1 = pd.DataFrame({'id': range(1,6), 'place': ['Kent','Lenawee','Washtenaw','Berrien','Ottawa']})

id_1 place
0 1 Kent
1 2 Lenawee
2 3 Montreal
3 4 Berrien
4 5 Ottawa


test2 = pd.DataFrame({'id_2': range(6,11), 'id_parent': range(1,6)})

id_2 id_parent
0 6 1
1 7 2
2 8 3
3 9 4
4 10 5


Yet when I join the two tables, the last row doesn't join properly and, because it's a left join, results in NaN.

df = test2.join(test1,on='id_parent',how='left')

id_2 id_parent id_1 place
0 6 1 2 Lenawee
1 7 2 3 Montreal
2 8 3 4 Berrien
3 9 4 5 Ottawa
4 10 5 NaN NaN


This doesn't make sense to me-- id_parent and id_1 are the keys on which to join the two tables, and they both have the same value. Both columns have the same dtype (int64). What's going on here?

Answer

join joins primarily on indices, use merge for this:

In [18]:
test2.merge(test1,left_on='id_parent', right_on='id')

Out[18]:
   id_2  id_parent  id      place
0     6          1   1       Kent
1     7          2   2    Lenawee
2     8          3   3  Washtenaw
3     9          4   4    Berrien
4    10          5   5     Ottawa

You get the NaN because the rhs will use the rhs index and there is no entry for 0 and 5 so you get NaN