I trying to merge 2 data frames. The frames do not share columns (except the keys). So merging should just add columns of the right to the left. However, I am also getting extra rows. I don't understand where the 2 extra rows come from.
If I use left_index and right_index then it would have worked perfect. However, I don't understand how normal merging on keys would have 2 extra rows like in my result. Thanks
dat1 = np.array([['Afghanistan', 2007, 'new_ep_m1524', 0],['Afghanistan', 2007, 'new_sn_m65', 0],
['Afghanistan', 2012, 'new_sn_f014', 1190],['Afghanistan', 2011, 'new_sn_f014', 851],
['Afghanistan', 2013, 'newrel_m014', 1705]], dtype=object)
dat2 = np.array([['ep', 'male', '15-24', 'Afghanistan', 2007],['sn', 'male', '65+', 'Afghanistan', 2007],
['sn', 'female', '0-14', 'Afghanistan', 2012],['sn', 'female', '0-14', 'Afghanistan', 2011],
['rel', 'male', '0-14', 'Afghanistan', 2013]], dtype=object)
left = pd.DataFrame(data=dat1, columns=['country', 'year', 'case_type', 'count'] )
rigt = pd.DataFrame(data=dat2, columns=['type', 'gender', 'age_group', 'country', 'year'])
display(left), display(right)
pd.merge(left,right, on=['country', 'year'], how='outer')
You have keys that are repeated in your dataset. Afghanistan 2007
has two rows in each data frame. When merging with a full outer join, it is not clear which of the two records Afghanistan 2007
should match between the two data frames. So, both are joined. This is why there are four records with Afghanistan 2007
in the merged data frame (2 from the first data frame + 2 from the second data frame).