Duffau Duffau - 20 days ago 6
Python Question

Pandas DataFrame: How to calculate the rolling mean of both grouped and shifted data in version >0.19

I have the followig

pd.DataFrame
:

import pandas as pd

df = pd.DataFrame({'name': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'x1': [1, 2, 3, 4, 1, 2, 3, 4],
'x2': [4, 3, 2, 1, 4, 3, 2, 1]
})

> df
name x1 x2
0 a 1 4
1 a 2 3
2 a 3 2
3 a 4 1
4 b 1 4
5 b 2 3
6 b 3 2
7 b 4 1


I would like to calculate a rolling mean of
x1
and
x2
with
window
-size of 2 and
min_periods
of 1. The mean should be grouped by the
name
and the input to the
mean
-function should be shifted by one row, that is, the resulting row with index 2, should be calculated from rows (0,1). So for
x1
the rolling mean in row 2 should be (1+2)/2 = 1.5.

In Pandas version <= 0.18 I would do this:

> df.groupby('name').apply(lambda x: pd.rolling_mean(x.shift(1), window=2, min_periods=1))
x1 x2
0 NaN NaN
1 1.0 4.0
2 1.5 3.5
3 2.5 2.5
4 NaN NaN
5 1.0 4.0
6 1.5 3.5
7 2.5 2.5


Which is perfect, since row 0 and row 4 do not a any data, within each name group, of length 1, and the result should be
np.nan
.

In Pandas 0.19 and later the
rolling_mean
-function and functions alike, are throwing:

FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with
DataFrame.rolling(min_periods=1,center=False,window=2).mean()


So in Pandas version >= 0.19 this is the best approach I could come up with:

df_shifted = df.groupby('name').apply(lambda x: x.shift(1))
> df_shifted.groupby('name').rolling(window=2, min_periods=1).mean()
name x1 x2
name
a 1 a 1.0 4.0
2 a 1.5 3.5
3 a 2.5 2.5
b 5 b 1.0 4.0
6 b 1.5 3.5
7 b 2.5 2.5


But this removes the
nan
-rows which I would like to keep for array-dimension reasons and returns a DataFrame with
MultiIndex
.

Is there a nice one-line-kind-of-way of solving this while keeping the
nan
-rows and returning a DataFrame with a flat index?

EDIT
The method should handle nan's like the 0.18-method. So if
x1 = [np.nan, 2, 3, 4, 1, 2, 3, 4]
the rolling mean at index 1 should return
np.nan
, but the rolling mean at index 2 should return
2.0
, since
(np.nan + 2)/1 -> 2.0
that is number of non-nan's is less or equal to
min_periods
.

Answer

To avoid the Deprecation warnings, starting with version 0.19.1, you can rewrite the syntax as shown:

  1. shift the DF by 1 level
  2. compute rolling mean

df.groupby('name').apply(lambda x: x.shift().rolling(window=2, min_periods=1).mean()
#                                  DataFrame.rolling(*args, **kwargs).mean()

enter image description here