I trying to merge 2 data frames. The frames do not share columns (except the keys). So merging should just add columns of the right to the left. However, I am also getting extra rows. I don't understand where the 2 extra rows come from.
If I use left_index and right_index then it would have worked perfect. However, I don't understand how normal merging on keys would have 2 extra rows like in my result. Thanks
dat1 = np.array([['Afghanistan', 2007, 'new_ep_m1524', 0],['Afghanistan', 2007, 'new_sn_m65', 0],
['Afghanistan', 2012, 'new_sn_f014', 1190],['Afghanistan', 2011, 'new_sn_f014', 851],
['Afghanistan', 2013, 'newrel_m014', 1705]], dtype=object)
dat2 = np.array([['ep', 'male', '15-24', 'Afghanistan', 2007],['sn', 'male', '65+', 'Afghanistan', 2007],
['sn', 'female', '0-14', 'Afghanistan', 2012],['sn', 'female', '0-14', 'Afghanistan', 2011],
['rel', 'male', '0-14', 'Afghanistan', 2013]], dtype=object)
left = pd.DataFrame(data=dat1, columns=['country', 'year', 'case_type', 'count'] )
rigt = pd.DataFrame(data=dat2, columns=['type', 'gender', 'age_group', 'country', 'year'])
pd.merge(left,right, on=['country', 'year'], how='outer')
You have keys that are repeated in your dataset.
Afghanistan 2007 has two rows in each data frame. When merging with a full outer join, it is not clear which of the two records
Afghanistan 2007 should match between the two data frames. So, both are joined. This is why there are four records with
Afghanistan 2007 in the merged data frame (2 from the first data frame + 2 from the second data frame).