Sabor Sabor - 2 years ago 140
Python Question

resample and aggregate using *multiple* *named* aggregation functions on *multiple* columns

I have a dataframe like

import pandas as pd
import numpy as np
range = pd.date_range('2015-01-01', '2015-01-5', freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['otherF'] = np.random.randint(low=2, high=42, size=len(df.index))


I can easily resample and apply a builtin as sum():

df['speed'].resample('1D').sum()
Out[121]:
2015-01-01 2865
2015-01-02 2923
2015-01-03 2947
2015-01-04 2751


I can also apply a custom function returning multiple values:

def mu_cis(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
return np.mean(x_),np.mean(x_)-CI,np.mean(x_)+CI,len(x_)

df['speed'].resample('1D').agg(mu_cis)
Out[122]:
2015-01-01 (29.84375, [28.1098628611], [31.5776371389], 96)
2015-01-02 (30.4479166667, [28.7806726396], [32.115160693...
2015-01-03 (30.6979166667, [29.0182072972], [32.377626036...
2015-01-04 (28.65625, [26.965228204], [30.347271796], 96)


As I have read here, I can even multiple values with a name, pandas apply function that returns multiple values to rows in pandas dataframe

def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])

df['speed'].resample('1D').agg(myfunc1)


which gives

Out[124]:
2015-01-01 MU 29.8438
MU+ [31.5776371389]
MU- [28.1098628611]
2015-01-02 MU 30.4479
MU+ [32.1151606937]
MU- [28.7806726396]
2015-01-03 MU 30.6979
MU+ [32.3776260361]
MU- [29.0182072972]
2015-01-04 MU 28.6562
MU+ [30.347271796]
MU- [26.965228204]


However, when I try to apply this to all the original columns by, I only get
NaN
s:

df.resample('1D').agg(myfunc1)
Out[127]:
speed otherF
2015-01-01 NaN NaN
2015-01-02 NaN NaN
2015-01-03 NaN NaN
2015-01-04 NaN NaN
2015-01-05 NaN NaN


Results do not change using
agg
or
apply
after the
resample()
.

What is the right way to do this?

Answer Source

The problem is in myfunc1. It tries to return a pd.Series, while you have a pd.DataFrame. The following seems to work just fine.

def myfunc1(x):
    x_=x[~np.isnan(x)]
    CI=np.std(x_)/np.sqrt(x.shape)
    e=np.mean(x_)
    f=np.mean(x_)+CI
    g=np.mean(x_)-CI
    try:
        return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
    except AttributeError: #will still raise errors of other nature
        return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])

Alternatively:

def myfunc1(x):
    x_=x[~np.isnan(x)]
    CI=np.std(x_)/np.sqrt(x.shape)
    e=np.mean(x_)
    f=np.mean(x_)+CI
    g=np.mean(x_)-CI
    if x.ndim > 1: #Equivalent to if len(x.shape) > 1
        return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
    return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])    
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download