gaatjeniksaan gaatjeniksaan - 9 months ago 71
Python Question

Returning nlargest for nested sorted dataframe using Pandas

I am struggling with a dataframe that lists a large amount of data that has several filtering levels if that makes sense. So my data is now sorted per state name alphabetically. Second filtering level applies to the population data per county. The filtering i have used was

. I now need to extract the highest three population county counts per state and I am at a loss how to pursue this.

So my data looks something like this (I omitted the row index here):

State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5

Now what I want from my dataframe is to get the highest populated counties per state. Not having worked with pandas much my initial guess would be to do a for. Something like this (obviously erronous):

list_l = []
for i in df['State'].unique(): #Get a series with every unique state listed once
list_l.append(df[(df['State'] == i])).nlargest(3)
return list_l

Even when typing this I know it doesn't fully make sense but I am not sure how to explain it better. Hope someone can help.


Answer Source

I think you need groupby with nlargest:

s = df.groupby('State')['Population'].nlargest(3)
print (s)
Alabama  0    100
         1     50
         2     40
Wyoming  5    180
         6    150
         7     56
Name: Population, dtype: int64

Then get_level_values from second level and select by ix:

print (s.index.get_level_values(1))
Int64Index([0, 1, 2, 5, 6, 7], dtype='int64')

print (df.ix[s.index.get_level_values(1)])
     State County  Population
0  Alabama      a         100
1  Alabama      b          50
2  Alabama      c          40
5  Wyoming   a.51         180
6  Wyoming   b.51         150
7  Wyoming   c.51          56