Anastasia Clark Anastasia Clark - 2 months ago 8
Python Question

Pandas: remove duplicate record while keeping its old value in dataframe for reference

I am rewriting a piece of the old code using pandas. My data frame looks like this:

index stop_id stop_name stop_lat stop_lon stop_id2
0 A12 Some St 40.889248 -73.898583 None
1 A14 Some St 40.889758 -73.908573 None
2 B09 Some St 40.788924 -74.846576 None
3 A22 Some St 40.889248 -73.898583 None


Note that stop_lat and stop_lon are duplicated for stop_ids 'A12' and 'A22'.

I want to remove the duplicate stop (stop_id='A22') while updating stop_d2 with the removed record's stop_id. So the data frame would look like this:

index stop_id stop_name stop_lat stop_lon stop_id2
0 A12 Some St 40.889248 -73.898583 A22
1 A14 Some St 40.889758 -73.908573 None
2 B09 Some St 40.788924 -74.846576 None


Previously I have worked this task with keeping my data in dictionary:

d={'A12':['Some St', 40.889248, -73.898583, None],'A14': ['Some St', 40.889758, -73.908573, None],'B09':['Some St, 40.788924,-74.846576, None], 'A22':['Some St', 40.889248, -73.898583, None]}

if d['A12'][1]+d['A12'][2]==d['A22'][1]+d['A22'][2]:
del d['A22']
d['A12'][-1]='A22'


I want to do similar task in pandas. I know if I just use:
df=df.drop_duplicates(['stop_lat','stop_lon'])

I will loose the duplicate record and won't retain its id.I need to keep id of the removed stop for proper metadata.

Answer
new_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep='first')]

duplicates_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep = 'last')][['stop_lat', 'stop_lon', 'stop_id']]

new_df.merge(duplicates_df, how='left', on=['stop_lat, 'stop_lon'])
Comments