nitin - 1 year ago 102
Python Question

# Vectorizing standard deviation calculations for pandas dataseries

I have a pandas dataseries, like so,

``````data = [1,2,3,2,4,5,6,3,5]
ds = pd.Series(data)
print (ds)

0    1
1    2
2    3
3    2
4    4
5    5
6    6
7    3
8    5
``````

I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for
`ds[0:4]`
.

I have done this with the following code,

``````df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data

for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
print (df)

data avreturns         sd
0     1         1        NaN
1     2       1.5  0.7071068
2     3         2          1
3     2         2  0.8164966
4     4       2.4   1.140175
5     5  2.833333    1.47196
6     6  3.285714   1.799471
7     3      3.25   1.669046
8     5  3.444444   1.666667
``````

This works, but I using a loop and it is slow. Is there a way to vectorize this?

I was able to vectorize the mean calculations by using the
`cumsum()`
function:

``````df.data.cumsum()/(df.index+1)
``````

Is there a way to vectorize the standard deviation calculations?

You might be interested in `pd.expanding_std`, which calculates the cumulative standard deviation for you:

``````>>> pd.expanding_std(ds)
0         NaN
1    0.707107
2    1.000000
3    0.816497
4    1.140175
5    1.471960
6    1.799471
7    1.669046
8    1.666667
dtype: float64
``````

For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download