RDJ RDJ - 7 months ago 30
Python Question

Pandas: return a Dataframe with multiple aggregate values conditioned on another value

I'm trying to do something well beyond my Pandas level and have spent far too much time getting this wrong. In this example I need to return individual Dataframes for each of the teams. The dataframes would show the mean cost, mean area, and sum of size, for each grade.

Because I need to produce separate tables, I probably need to pass single team names into a function over and over. To be clear, I'm happy to pass the team names into a function (or similar) manually to produce each table.

team grade cost area size
0 man utd 1 52300 5 1045
1 chelsea 3 52000 42 957
2 arsenal 2 25000 20 1099
3 man utd 1 61600 20 1400
4 man utd 2 43000 43 1592
5 arsenal 2 23400 78 1006
6 man utd 2 52300 89 987
7 chelsea 4 62000 30 849
8 arsenal 1 62000 46 973
9 arsenal 2 73000 78 1005


The man utd dataframe would look like this for example:

grade mean_cost mean_area size
1 56590 12.5 2445
2 47650 66 2579

Answer

Use groupby/agg to group by both the team and grade, and the aggregate the cost, area and size columns. Note that agg can accept a dict whose keys are column names and whose values are aggreation functions (such as mean or sum). Thus you can specify aggregation functions on a per-column basis.

In [120]: df.groupby(['team', 'grade']).agg({'cost':'mean', 'area':'mean', 'size':'sum'}).rename(columns={'cost':'mean_cost', 'area':'mean_area'})
Out[120]: 
               size     mean_cost  mean_area
team    grade                               
arsenal 1       973  62000.000000  46.000000
        2      3110  40466.666667  58.666667
chelsea 3       957  52000.000000  42.000000
        4       849  62000.000000  30.000000
man utd 1      2445  56950.000000  12.500000
        2      2579  47650.000000  66.000000

groupby returns an iterable. Therefore, to make a dict mapping team names to DataFrames you could use:

dfs = {team:grp for team, grp in result.reset_index().groupby('team')}

For example,

import pandas as pd

df = pd.DataFrame(
    {'area': [5, 42, 20, 20, 43, 78, 89, 30, 46, 78],
     'cost': [52300, 52000, 25000, 61600, 43000, 23400, 52300, 62000, 62000, 73000], 
     'grade': [1, 3, 2, 1, 2, 2, 2, 4, 1, 2], 'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005], 
     'team': ['man utd', 'chelsea', 'arsenal', 'man utd', 'man utd', 'arsenal', 'man utd', 'chelsea', 'arsenal', 'arsenal']})

result =  df.groupby(['team', 'grade']).agg({'cost':'mean', 'area':'mean', 'size':'sum'}).rename(columns={'cost':'mean_cost', 'area':'mean_area'})

dfs = {team:grp.drop('team', axis=1) 
       for team, grp in result.reset_index().groupby('team')}

for team, grp in dfs.items():
    print('{}:\n{}\n'.format(team, grp))

yields

chelsea:
   grade  mean_cost  mean_area  size
2      3      52000         42   957
3      4      62000         30   849

arsenal:
   grade     mean_cost  mean_area  size
0      1  62000.000000  46.000000   973
1      2  40466.666667  58.666667  3110

man utd:
   grade  mean_cost  mean_area  size
4      1      56950       12.5  2445
5      2      47650       66.0  2579

Beware that for better performance try to avoid breaking up DataFrames into smaller DataFrames, because once you use a dict or a list, you are forced to use Python loops instead of the faster implicit C-compiled loops used by Pandas/NumPy methods.

So for computation try to stick with the result DataFrame. Use the dfs dict only if you have to do something like print the DataFrames separately.