Ryan Ryan - 2 months ago 28
Python Question

Dropping duplicates in a dataframe?

Consider the following dataframe snippet, which has been sorted by Winner_Count.

Year Award Winner Name Winner_Count Winner_Pct
9347 2011 Best Actress 1.0 Meryl Streep 19 0.010144
9098 2009 Best Actress 0.0 Meryl Streep 19 0.010144
7483 1995 Best Actress 0.0 Meryl Streep 19 0.010144
6389 1985 Best Actress 0.0 Meryl Streep 19 0.010144
7835 1998 Best Actress 0.0 Meryl Streep 19 0.010144


All I want to do is groupby Name, so that I don't have the same actor 19 times in a row (e.g, seeing Meryl, and her Winner_Count, only once would be fine), but the sorted order is preserved. So far, I've gotten various error messages and, on one occasion, an object reference. I have yet to see a table. Some of the posts I've seen here suggest making a groupby object appear requires considerably more work than, for example, what is shown in Wes McKinney's video, which is strange.

Why is this not a simple
df_new = df.groupby('Name')
? And why won't the object appear automatically when/if a reference appears? I seem to be missing something fundamental about the groupby object and need a correction. Thoughts?

Edit:

The desired data set would look like this: one row for each actor, whereas in the original data set, there would be several.

Year Award Winner Name Winner_Count Winner_Pct
9347 2011 Best Actress 1.0 Meryl Streep 19 0.010144
5953 1981 Best Actress 1.0 Katharine Hepburn 12 0.006407
657 1938 Best Actress 1.0 Bette Davis 10 0.005339

Answer Source

Based on your edit, I think you need df.drop_duplicates:

In [352]: df_revised = df.drop_duplicates(subset='Name'); df_revised
Out[352]: 
   Year         Award  Winner          Name  Winner_Count  Winner_Pct
0  2011  Best Actress     1.0  Meryl Streep            19    0.010144

It retains the first row and drops all the rest of the duplicates. This works fine if your data is sorted by year.

If not, sort it first with df.sort_values:

In [358]: df.sort_values(by=['Name', 'Year'], ascending=False, inplace=True)