Andreas Andreas - 29 days ago 12
Python Question

Combining groupby and apply in multiIndex DataFrames

I am working with a multiIndex DataFrame and want to do some groupby / apply() operations. I am struggling with how to combine groupby and apply.

I would like to extract the values of two indices of my DataFrame and compare those values in an apply function.

For those occurrences where the apply function is true, I would like to do a groupby / sum over the values of my DataFrame.

Is there a good way to do this without using for loops?




# Index specifier
ix = pd.MultiIndex.from_product(
[['2015', '2016', '2017', '2018'],
['2016', '2017', '2018', '2019', '2020'],
['A', 'B', 'C']],
names=['SimulationStart', 'ProjectionPeriod', 'Group']
)

df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()

# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
.apply(lambda x: x['ProjectionPeriod'] - \
x['SimulationStart'] <= 1).sum()

Answer

You could calculate the difference, ProjectionPeriod - SimulationStart, before performing the groupby/sum operation.

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()

import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
    [[2015, 2016, 2017, 2018], 
     [2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)

yields

                          Input
SimulationStart Group          
2015            A      0.821851
                B     -0.643342
                C     -0.140112
2016            A      0.384885
                B     -0.252186
                C     -1.057493
2017            A     -1.055933
                B      1.096221
                C     -4.150002
2018            A      0.584859
                B     -4.062078
                C      1.225105
Comments