nitin - 2 months ago 20

Python Question

I have a pandas dataseries, like so,

`data = [1,2,3,2,4,5,6,3,5]`

ds = pd.Series(data)

print (ds)

0 1

1 2

2 3

3 2

4 4

5 5

6 6

7 3

8 5

I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for

`ds[0:4]`

I have done this with the following code,

`df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])`

df.data = data

for i in df.index:

dataslice = df.ix[0:i]

df['avreturns'].loc[i] = dataslice.data.mean()

df['sd'].loc[i] = dataslice.data.std()

print (df)

data avreturns sd

0 1 1 NaN

1 2 1.5 0.7071068

2 3 2 1

3 2 2 0.8164966

4 4 2.4 1.140175

5 5 2.833333 1.47196

6 6 3.285714 1.799471

7 3 3.25 1.669046

8 5 3.444444 1.666667

This works, but I using a loop and it is slow. Is there a way to vectorize this?

I was able to vectorize the mean calculations by using the

`cumsum()`

`df.data.cumsum()/(df.index+1)`

Is there a way to vectorize the standard deviation calculations?

Answer

You might be interested in `pd.expanding_std`

, which calculates the cumulative standard deviation for you:

```
>>> pd.expanding_std(ds)
0 NaN
1 0.707107
2 1.000000
3 0.816497
4 1.140175
5 1.471960
6 1.799471
7 1.669046
8 1.666667
dtype: float64
```

For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.