alphanumeric alphanumeric - 1 month ago 6
Python Question

How to drop duplicate from DataFrame taking into account value of another column

When I drop

John
as duplicate specifying 'name' as the column name:

import pandas as pd
data = {'name':['Bill','Steve','John','John','John'], 'age':[21,28,22,30,29]}
df = pd.DataFrame(data)
df = df.drop_duplicates('name')


pandas drops all matching entities leaving the left-most:

age name
0 21 Bill
1 28 Steve
2 22 John


Instead I would like to keep the row where John's age is the highest (in this example it is the age 30. How to achieve this?

Answer

Try this:

In [75]: df
Out[75]:
   age   name
0   21   Bill
1   28  Steve
2   22   John
3   30   John
4   29   John

In [76]: df.sort_values('age').drop_duplicates('name', keep='last')
Out[76]:
   age   name
0   21   Bill
1   28  Steve
3   30   John

or this depending on your goals:

In [77]: df.drop_duplicates('name', keep='last')
Out[77]:
   age   name
0   21   Bill
1   28  Steve
4   29   John
Comments