David Chouinard David Chouinard - 19 days ago 6
Python Question

Naming returned columns in Pandas aggregate function?

I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns.

This comes very close, but the data structure returned has nested column headings:

data.groupby("Country").agg(
{"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})


(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")

What am I missing?

Answer

This will drop the outermost level from the hierarchical column index:

df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)

For example:

import pandas as pd
import pandas.rpy.common as com
import numpy as np

data = com.load_data('Loblolly')
print(data.head())
#     height  age Seed
# 1     4.51    3  301
# 15   10.89    5  301
# 29   28.72   10  301
# 43   41.74   15  301
# 57   52.70   20  301

df = data.groupby('Seed').agg(
    {'age':{'foo':'sum'},
     'height':{'mean':'mean', 'std':'std'}})
print(df.head())
#       age     height           
#       foo        std       mean
# Seed                           
# 301    78  22.638417  33.246667
# 303    78  23.499706  34.106667
# 305    78  23.927090  35.115000
# 307    78  22.222266  31.328333
# 309    78  23.132574  33.781667

df.columns = df.columns.droplevel(0)
print(df.head())

yields

      foo        std       mean
Seed                           
301    78  22.638417  33.246667
303    78  23.499706  34.106667
305    78  23.927090  35.115000
307    78  22.222266  31.328333
309    78  23.132574  33.781667