horcle_buzz horcle_buzz - 5 months ago 23
Python Question

Pandas: Filtering on describe output (count)

I am trying to limit the output returned by the describe output to a subset of only those records with a count great than or equal to any given number.

My dataframe is a subset of a larger one, and is defined as:

df = evaluations[['score','garden_id']]


When I run describe on this,

df.groupby('garden_id').describe()


I would like to filter the returned data to those records where count > a specified number.

My output looks like:

Out[39]:
score
garden_id
37254 count 6
mean 20
std 0
min 20
25% 20
50% 20
75% 20
max 20
37273 count 1
mean 17
std NaN
min 17
25% 17
50% 17
75% 17
max 17
37284 count 1
mean 19
std NaN
min 19
25% 19
50% 19
75% 19
max 19
37288 count 1
mean 11
std NaN
min 11
25% 11
50% 11
...


I was going to try modifying something like: modify-output-from-python-pandas-describe, but I only get this:

Out[40]:
Empty DataFrame
Columns: [score]
Index: []


when I run
df.groupby('garden_id').describe().loc[['count']]


I did try
newframe = df.groupby('garden_id').describe().count() > 3
, but I get a mask showing which records are true and which false, so not really sure how to use this, either.

Is there a way I can filter out desired records directly using the
describe
method without having to deal with data masking, etc.?

Answer

I think you can use slicers for selecting and then boolean indexing for finding index idx1, where mask is True:

import pandas as pd

df = pd.DataFrame({'score':[1,2,3,3,1,2],
                   'garden_id':[1,1,1,1,2,2]})

print (df)
   garden_id  score
0          1      1
1          1      2
2          1      3
3          1      3
4          2      1
5          2      2

newframe = df.groupby('garden_id').describe()
print (newframe)
                    score
garden_id                
1         count  4.000000
          mean   2.250000
          std    0.957427
          min    1.000000
          25%    1.750000
          50%    2.500000
          75%    3.000000
          max    3.000000
2         count  2.000000
          mean   1.500000
          std    0.707107
          min    1.000000
          25%    1.250000
          50%    1.500000
          75%    1.750000
          max    2.000000
idx = pd.IndexSlice
mask = newframe.loc[idx[:,'count'],:] > 3
print (mask)
                 score
garden_id             
1         count   True
2         count  False

idx1 = mask[mask.values].index.get_level_values('garden_id')
print (idx1)
Int64Index([1], dtype='int64', name='garden_id')

print (newframe.loc[idx[idx1,:],:])
                    score
garden_id                
1         count  4.000000
          mean   2.250000
          std    0.957427
          min    1.000000
          25%    1.750000
          50%    2.500000
          75%    3.000000
          max    3.000000
Comments