Brian - 1 year ago 80

Python Question

I have a time series with an irregularly spaced index. I want to transform the data by subtracting a mean and dividing by a standard deviation for every point. However, I only want to calculate the means and standard deviations using those data values that are a predefined time distance away. In my example below, I used regularly spaced distances but I want this to accommodate irregular ones as well.

For example:

`n = 20`

ts = pd.Series(np.random.rand(n),

pd.date_range('2014-05-01', periods=n, freq='T', name='Time'))

Lets say I want the zscore for each point relative to all points within one minute of that point.

The final result should look like the following series.

`Time`

2014-05-01 00:00:00 0.707107

2014-05-01 00:01:00 -0.752435

2014-05-01 00:02:00 0.866662

2014-05-01 00:03:00 -0.576136

2014-05-01 00:04:00 -0.580471

2014-05-01 00:05:00 -0.253403

2014-05-01 00:06:00 -0.076657

2014-05-01 00:07:00 1.054413

2014-05-01 00:08:00 0.095783

2014-05-01 00:09:00 -1.030982

2014-05-01 00:10:00 1.041127

2014-05-01 00:11:00 -1.028084

2014-05-01 00:12:00 0.198363

2014-05-01 00:13:00 0.851951

2014-05-01 00:14:00 -1.152701

2014-05-01 00:15:00 1.070238

2014-05-01 00:16:00 -0.395849

2014-05-01 00:17:00 -0.968585

2014-05-01 00:18:00 0.077004

2014-05-01 00:19:00 0.707107

Freq: T, dtype: float64

Answer Source

This is something I've been working on. Keep in mind this is related to but different than (as I suspect you know, otherwise you probably wouldn't be asking the question) pandas `rolling`

feature. For your the regularly spaced data you gave, it would tie out pretty well and we can use that to compare.

What I'll do is use `np.subtract.outer`

to compute the distances of all items in a series with itself.

Assume we have your time series `ts`

```
import pandas as pd
import numpy as np
n = 20
np.random.seed([3,1415])
data = np.random.rand(n)
tidx = pd.date_range('2014-05-01', periods=n, freq='T', name='Time')
# ^
# |
# Minute Frequency
ts = pd.Series(data, tidx, name='Bliggles')
```

Now I can use the time index to calculate distaces like so

```
distances = pd.DataFrame(np.subtract.outer(tidx, tidx), tidx, tidx).abs()
```

From here, I test what is less than a desired distance. Say that distance is called `delta`

```
lt_delta = (distances <= delta).stack()
lt_delta = lt_delta[lt_delta]
```

Finally, I take the values from the index of `lt_delta`

and find what the corresponding values were in `ts`

```
pd.Series(ts.ix[lt_delta.index.to_series().str.get(1)].values, lt_delta.index)
```

I return a `groupby`

object so it looks and feels like calling `rolling`

. When I wrap it in a function, it looks like

```
def groupbydelta(ts, delta):
tidx = ts.index
distances = pd.DataFrame(np.subtract.outer(tidx, tidx), tidx, tidx).abs()
lt_delta = (distances <= delta).stack()
lt_delta = lt_delta[lt_delta]
closest = pd.Series(ts.ix[lt_delta.index.to_series().str.get(1)].values, lt_delta.index)
return closest.groupby(level=0)
```

Let's test it out. I'll use a `delta=pd.Timedelta(1, 'm')`

(that's one minute). For the time series I created, for every date time index, I should see that index, the minute prior, and the minute after. This should be equivalent to `ts.rolling(3, center=True)`

with the exceptions at the edges. I'll do both and compare.

```
gbdelta = groupbydelta(ts, pd.Timedelta(1, 'm')).mean()
rolling = ts.rolling(3, center=True).mean()
pd.concat([gbdelta, rolling], axis=1, keys=['Delta', 'Rolling']).head()
```

That looks great! Difference between the two being that `rolling`

has `NaN`

at the edges while `gbdelta`

doesn't require a specific number of elements, but that was by design.

What about irregular indices?

```
np.random.seed([3,1415])
n = 7200
data = np.random.rand(n)
tidx = (pd.to_datetime(['2013-02-06']) + np.random.rand(n) * pd.Timedelta(1, 'd'))
irregular_series = pd.Series(data, tidx, name='Sketch').sort_index()
```

And plot the `irregular_series`

and some filtered versions based on closest neighbors.

But you asked for zscores:

```
zd = (irregular_series - gbirr.mean()) / gbirr.std()
```

This z-scoring is a bit tricky. I had to find the grouped means and standard deviations and then use them with the original series. I'm still thinking about a smother way. But that's smooth enough.

What does it look like?

```
fig, axes = plt.subplots(1, 2, sharey=True, figsize=[10, 5])
irregular_series.plot(style='.', ax=axes[0], title='Original')
zd.plot(style='.', ax=axes[1], title='Z-Scored')
```

Finally, you asked about the z-score for your data example. To ensure I got the right answer...

```
gbd = groupbydelta(ts, pd.Timedelta(1, 'm'))
ts.sub(gbd.mean()).div(gbd.std())
Time
2014-05-01 00:00:00 0.707107
2014-05-01 00:01:00 -0.752435
2014-05-01 00:02:00 0.866662
2014-05-01 00:03:00 -0.576136
2014-05-01 00:04:00 -0.580471
2014-05-01 00:05:00 -0.253403
2014-05-01 00:06:00 -0.076657
2014-05-01 00:07:00 1.054413
2014-05-01 00:08:00 0.095783
2014-05-01 00:09:00 -1.030982
2014-05-01 00:10:00 1.041127
2014-05-01 00:11:00 -1.028084
2014-05-01 00:12:00 0.198363
2014-05-01 00:13:00 0.851951
2014-05-01 00:14:00 -1.152701
2014-05-01 00:15:00 1.070238
2014-05-01 00:16:00 -0.395849
2014-05-01 00:17:00 -0.968585
2014-05-01 00:18:00 0.077004
2014-05-01 00:19:00 0.707107
Freq: T, dtype: float64
```