nesi nesi - 2 months ago 17
Python Question

Pandas: Find duplicates and modify them based on date

I have a dataframe and I want to find the duplicates based on Color and Price. Then change the code with the code of the most recent(based on the Date) row. I don't want any row to be deleted.
The dataframe is this:

id Color Price Code Date
1 White 1.50 111 3-22-2017 12:00
2 Green 2.20 222 3-23-2017 09:55
3 Black 3.00 333 3-24-2017 11:45
4 White 1.50 111 3-23-2017 10:20
5 White 1.50 444 3-23-2017 08:15
6 Green 2.20 555 3-25-2017 07:05


the result should be this:

id Color Price Code Date
1 White 1.50 111 3-22-2017 12:00
2 Green 2.20 **555** 3-23-2017 09:55
3 Black 3.00 333 3-24-2017 11:45
4 White 1.50 111 3-23-2017 10:20
5 White 1.50 **111** 3-23-2017 08:15
6 Green 2.20 555 3-25-2017 07:05


I know that the answer is close to the answer here but the form of the date confuses me.

Answer Source

I think you need sort_values with GroupBy.transform and function last:

#if not datetime, convert it
df['Date'] = pd.to_datetime(df['Date'])

df['Code'] = df.sort_values('Date').groupby(['Color', 'Price'])['Code'].transform('last')
print (df)
   id  Color  Price  Code                Date
0   1  White    1.5   111 2017-03-22 12:00:00
1   2  Green    2.2   555 2017-03-23 09:55:00
2   3  Black    3.0   333 2017-03-24 11:45:00
3   4  White    1.5   111 2017-03-23 10:20:00
4   5  White    1.5   111 2017-03-23 08:15:00
5   6  Green    2.2   555 2017-03-25 07:05:00