Richard Richard - 3 months ago 23
Python Question

List most common members in Pandas group?

I have a dataframe with columns like this:

id lead_sponsor lead_sponsor_class
02837692 Janssen Research & Development, LLC Industry
02837679 Aarhus University Hospital Other
02837666 Universidad Autonoma de Ciudad Juarez Other
02837653 Universidad Autonoma de Madrid Other
02837640 Beirut Eye Specialist Hospital Other


I want to find the most common lead sponsors. I can list the size of each group using:

df.groupby(['lead_sponsor', 'lead_sponsor_class']).size()


which gives me this:

lead_sponsor lead_sponsor_class
307 Hospital of PLA Other 1
3E Therapeutics Corporation Industry 1
3M Industry 4
4SC AG Industry 8
5 Santé Other 1


But how do I find the top 10 most common groups? If I do:

df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().sort_values(ascending=False).head(10)


Then I get an error:


AttributeError: 'Series' object has no attribute 'sort_values'

Answer

I think you can use Series.nlargest:

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().nlargest(10))

In docs is Notes:

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Sample:

import pandas as pd

df = pd.DataFrame({'id': {0: 2837692, 1: 2837679, 2: 2837666, 3: 2837653, 4: 2837640}, 
                   'lead_sponsor': {0: 'a', 1: 'a', 2: 'a', 3: 's', 4: 's'}, 
                   'lead_sponsor_class': {0: 'Industry', 1: 'Other', 2: 'Other', 3: 'Other', 4: 'Other'}})

print (df)
        id lead_sponsor lead_sponsor_class
0  2837692            a           Industry
1  2837679            a              Other
2  2837666            a              Other
3  2837653            s              Other
4  2837640            s              Other

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size())
lead_sponsor  lead_sponsor_class
a             Industry              1
              Other                 2
s             Other                 2
dtype: int64

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().sort_values(ascending=False).head(2))
lead_sponsor  lead_sponsor_class
s             Other                 2
a             Other                 2
dtype: int64

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().nlargest(2))
lead_sponsor  lead_sponsor_class
a             Other                 2
s             Other                 2
dtype: int64
Comments