nitin nitin - 10 days ago 5
Python Question

Vectorizing standard deviation calculations for pandas dataseries

I have a pandas dataseries, like so,

data = [1,2,3,2,4,5,6,3,5]
ds = pd.Series(data)
print (ds)

0 1
1 2
2 3
3 2
4 4
5 5
6 6
7 3
8 5


I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for
ds[0:4]
.

I have done this with the following code,

df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data

for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
print (df)

data avreturns sd
0 1 1 NaN
1 2 1.5 0.7071068
2 3 2 1
3 2 2 0.8164966
4 4 2.4 1.140175
5 5 2.833333 1.47196
6 6 3.285714 1.799471
7 3 3.25 1.669046
8 5 3.444444 1.666667


This works, but I using a loop and it is slow. Is there a way to vectorize this?

I was able to vectorize the mean calculations by using the
cumsum()
function:

df.data.cumsum()/(df.index+1)


Is there a way to vectorize the standard deviation calculations?

Answer

You might be interested in pd.expanding_std, which calculates the cumulative standard deviation for you:

>>> pd.expanding_std(ds)
0         NaN
1    0.707107
2    1.000000
3    0.816497
4    1.140175
5    1.471960
6    1.799471
7    1.669046
8    1.666667
dtype: float64

For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.