Huy Nguyen Huy Nguyen - 3 years ago 228
Python Question

extra rows in merged data frames pandas

I trying to merge 2 data frames. The frames do not share columns (except the keys). So merging should just add columns of the right to the left. However, I am also getting extra rows. I don't understand where the 2 extra rows come from.

If I use left_index and right_index then it would have worked perfect. However, I don't understand how normal merging on keys would have 2 extra rows like in my result. Thanks

dat1 = np.array([['Afghanistan', 2007, 'new_ep_m1524', 0],['Afghanistan', 2007, 'new_sn_m65', 0],
['Afghanistan', 2012, 'new_sn_f014', 1190],['Afghanistan', 2011, 'new_sn_f014', 851],
['Afghanistan', 2013, 'newrel_m014', 1705]], dtype=object)

dat2 = np.array([['ep', 'male', '15-24', 'Afghanistan', 2007],['sn', 'male', '65+', 'Afghanistan', 2007],
['sn', 'female', '0-14', 'Afghanistan', 2012],['sn', 'female', '0-14', 'Afghanistan', 2011],
['rel', 'male', '0-14', 'Afghanistan', 2013]], dtype=object)

left = pd.DataFrame(data=dat1, columns=['country', 'year', 'case_type', 'count'] )

rigt = pd.DataFrame(data=dat2, columns=['type', 'gender', 'age_group', 'country', 'year'])

display(left), display(right)
pd.merge(left,right, on=['country', 'year'], how='outer')

enter image description here

enter image description here

enter image description here

Answer Source

You have keys that are repeated in your dataset. Afghanistan 2007 has two rows in each data frame. When merging with a full outer join, it is not clear which of the two records Afghanistan 2007 should match between the two data frames. So, both are joined. This is why there are four records with Afghanistan 2007 in the merged data frame (2 from the first data frame + 2 from the second data frame).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download