Brian Brian - 4 months ago 17
Python Question

transform irregular timeseries into zscores relative to closest neighbors

I have a time series with an irregularly spaced index. I want to transform the data by subtracting a mean and dividing by a standard deviation for every point. However, I only want to calculate the means and standard deviations using those data values that are a predefined time distance away. In my example below, I used regularly spaced distances but I want this to accommodate irregular ones as well.

For example:

n = 20
ts = pd.Series(np.random.rand(n),
pd.date_range('2014-05-01', periods=n, freq='T', name='Time'))


Lets say I want the zscore for each point relative to all points within one minute of that point.

The final result should look like the following series.

Time
2014-05-01 00:00:00 0.707107
2014-05-01 00:01:00 -0.752435
2014-05-01 00:02:00 0.866662
2014-05-01 00:03:00 -0.576136
2014-05-01 00:04:00 -0.580471
2014-05-01 00:05:00 -0.253403
2014-05-01 00:06:00 -0.076657
2014-05-01 00:07:00 1.054413
2014-05-01 00:08:00 0.095783
2014-05-01 00:09:00 -1.030982
2014-05-01 00:10:00 1.041127
2014-05-01 00:11:00 -1.028084
2014-05-01 00:12:00 0.198363
2014-05-01 00:13:00 0.851951
2014-05-01 00:14:00 -1.152701
2014-05-01 00:15:00 1.070238
2014-05-01 00:16:00 -0.395849
2014-05-01 00:17:00 -0.968585
2014-05-01 00:18:00 0.077004
2014-05-01 00:19:00 0.707107
Freq: T, dtype: float64

Answer

This is something I've been working on. Keep in mind this is related to but different than (as I suspect you know, otherwise you probably wouldn't be asking the question) pandas rolling feature. For your the regularly spaced data you gave, it would tie out pretty well and we can use that to compare.

What I'll do is use np.subtract.outer to compute the distances of all items in a series with itself.

Assume we have your time series ts

import pandas as pd
import numpy as np

n = 20
np.random.seed([3,1415])
data = np.random.rand(n)
tidx = pd.date_range('2014-05-01', periods=n, freq='T', name='Time')
#                                                   ^
#                                                   |
#                                            Minute Frequency
ts = pd.Series(data, tidx, name='Bliggles')

Now I can use the time index to calculate distaces like so

distances = pd.DataFrame(np.subtract.outer(tidx, tidx), tidx, tidx).abs()

From here, I test what is less than a desired distance. Say that distance is called delta

lt_delta = (distances <= delta).stack()
lt_delta = lt_delta[lt_delta]

Finally, I take the values from the index of lt_delta and find what the corresponding values were in ts

pd.Series(ts.ix[lt_delta.index.to_series().str.get(1)].values, lt_delta.index)

I return a groupby object so it looks and feels like calling rolling. When I wrap it in a function, it looks like

Super Function

def groupbydelta(ts, delta):
    tidx = ts.index
    distances = pd.DataFrame(np.subtract.outer(tidx, tidx), tidx, tidx).abs()

    lt_delta = (distances <= delta).stack()
    lt_delta = lt_delta[lt_delta]
    closest = pd.Series(ts.ix[lt_delta.index.to_series().str.get(1)].values, lt_delta.index)

    return closest.groupby(level=0)

Let's test it out. I'll use a delta=pd.Timedelta(1, 'm') (that's one minute). For the time series I created, for every date time index, I should see that index, the minute prior, and the minute after. This should be equivalent to ts.rolling(3, center=True) with the exceptions at the edges. I'll do both and compare.

gbdelta = groupbydelta(ts, pd.Timedelta(1, 'm')).mean()
rolling = ts.rolling(3, center=True).mean()

pd.concat([gbdelta, rolling], axis=1, keys=['Delta', 'Rolling']).head()

enter image description here

That looks great! Difference between the two being that rolling has NaN at the edges while gbdelta doesn't require a specific number of elements, but that was by design.

What about irregular indices?

np.random.seed([3,1415])
n = 7200
data = np.random.rand(n)
tidx = (pd.to_datetime(['2013-02-06']) + np.random.rand(n) * pd.Timedelta(1, 'd'))
irregular_series = pd.Series(data, tidx, name='Sketch').sort_index()

And plot the irregular_series and some filtered versions based on closest neighbors.

enter image description here

But you asked for zscores:

zd = (irregular_series - gbirr.mean()) / gbirr.std()

This z-scoring is a bit tricky. I had to find the grouped means and standard deviations and then use them with the original series. I'm still thinking about a smother way. But that's smooth enough.

What does it look like?

fig, axes = plt.subplots(1, 2, sharey=True, figsize=[10, 5])
irregular_series.plot(style='.', ax=axes[0], title='Original')
zd.plot(style='.', ax=axes[1], title='Z-Scored')

enter image description here


Answer

Finally, you asked about the z-score for your data example. To ensure I got the right answer...

gbd = groupbydelta(ts, pd.Timedelta(1, 'm'))

ts.sub(gbd.mean()).div(gbd.std())

Time
2014-05-01 00:00:00    0.707107
2014-05-01 00:01:00   -0.752435
2014-05-01 00:02:00    0.866662
2014-05-01 00:03:00   -0.576136
2014-05-01 00:04:00   -0.580471
2014-05-01 00:05:00   -0.253403
2014-05-01 00:06:00   -0.076657
2014-05-01 00:07:00    1.054413
2014-05-01 00:08:00    0.095783
2014-05-01 00:09:00   -1.030982
2014-05-01 00:10:00    1.041127
2014-05-01 00:11:00   -1.028084
2014-05-01 00:12:00    0.198363
2014-05-01 00:13:00    0.851951
2014-05-01 00:14:00   -1.152701
2014-05-01 00:15:00    1.070238
2014-05-01 00:16:00   -0.395849
2014-05-01 00:17:00   -0.968585
2014-05-01 00:18:00    0.077004
2014-05-01 00:19:00    0.707107
Freq: T, dtype: float64