Tim Tröndle Tim Tröndle - 2 months ago 7
Python Question

Converting irregularly time stamped measurements into equally spaced, time-weighted averages

I have series of measurements which are time stamped and irregularly spaced. Values in these series always represent changes of the measurement -- i.e. without a change no new value. A simple example of such a series would be:

23:00:00.100 10
23:00:01.200 8
23:00:01.600 0
23:00:06.300 4


What I want to reach is an equally spaced series of time-weighted averages. For the given example I might aim at a frequency based on seconds and hence a result like the following:

23:00:01 NaN ( the first 100ms are missing )
23:00:02 5.2 ( 10*0.2 + 8*0.4 + 0*0.4 )
23:00:03 0
23:00:04 0
23:00:05 0
23:00:06 2.8 ( 0*0.3 + 4*0.7 )


I am searching for a Python library solving that problem. For me, this seems to be a standard problem, but I couldn't find such a functionality so far in standard libraries like pandas.

The algorithm needs to take two things into account:


  • time-weighted averaging

  • considering values ahead of the current interval ( and possibly even ahead of the lead ) when forming the average



Using pandas



data.resample('S', fill_method='pad') # forming a series of seconds


does parts of the work. Providing a user-defined function for aggregation will allow to form time-weighted averages, but because the beginning of the interval is ignored, this average will be incorrect too. Even worse: the holes in the series are filled with the average values, leading in the example from above to the values of seconds 3, 4 and 5 to be non zero.

data = data.resample('L', fill_method='pad') # forming a series of milliseconds
data.resample('S')


does the trick with a certain accurateness, but is -- depending on the accurateness -- very expensive. In my case, too expensive.

Edit: Solution



import pandas as pa
import numpy as np
from datetime import datetime
from datetime import timedelta

time_stamps=[datetime(2013,04,11,23,00,00,100000),
datetime(2013,04,11,23,00,1,200000),
datetime(2013,04,11,23,00,1,600000),
datetime(2013,04,11,23,00,6,300000)]
values = [10, 8, 0, 4]
raw = pa.TimeSeries(index=time_stamps, data=values)

def round_down_to_second(dt):
return datetime(year=dt.year, month=dt.month, day=dt.day,
hour=dt.hour, minute=dt.minute, second=dt.second)

def round_up_to_second(dt):
return round_down_to_second(dt) + timedelta(seconds=1)

def time_weighted_average(data):
end = pa.DatetimeIndex([round_up_to_second(data.index[-1])])
return np.average(data, weights=np.diff(data.index.append(end).asi8))

start = round_down_to_second(time_stamps[0])
end = round_down_to_second(time_stamps[-1])
range = pa.date_range(start, end, freq='S')
data = raw.reindex(raw.index + range)
data = data.ffill()

data = data.resample('S', how=time_weighted_average)

Answer

Here's a go at a solution, it may need some tweaking to meet your requirements.

Add the seconds to your index and fill forwards:

tees = pd.Index(datetime(2000, 1, 1, 23, 0, n) for n in xrange(8))
df2 = df1.reindex(df1.index + tees)
df2['value'] = df2.value.ffill()

In [14]: df2
Out[14]:
                            value
2000-01-01 23:00:00           NaN
2000-01-01 23:00:00.100000     10
2000-01-01 23:00:01            10
2000-01-01 23:00:01.200000      8
2000-01-01 23:00:01.600000      0
2000-01-01 23:00:02             0
2000-01-01 23:00:03             0
2000-01-01 23:00:04             0
2000-01-01 23:00:05             0
2000-01-01 23:00:06             0
2000-01-01 23:00:06.300000      4
2000-01-01 23:00:07             4

Take the time difference (using shift) til the next value, and multiply (value * seconds):

df3['difference'] = df3['index'].shift(-1) - df3['index']
df3['tot'] = df3.apply(lambda row: np.nan
                                   if row['difference'].seconds > 2  # a not very robust check for NaT
                                   else row['difference'].microseconds * row['value'] / 1000000,
                        axis=1)

In [17]: df3
Out[17]:
                        index  value      difference  tot
0         2000-01-01 23:00:00    NaN 00:00:00.100000  NaN
1  2000-01-01 23:00:00.100000     10 00:00:00.900000  9.0
2         2000-01-01 23:00:01     10 00:00:00.200000  2.0
3  2000-01-01 23:00:01.200000      8 00:00:00.400000  3.2
4  2000-01-01 23:00:01.600000      0 00:00:00.400000  0.0
5         2000-01-01 23:00:02      0        00:00:01  0.0
6         2000-01-01 23:00:03      0        00:00:01  0.0
7         2000-01-01 23:00:04      0        00:00:01  0.0
8         2000-01-01 23:00:05      0        00:00:01  0.0
9         2000-01-01 23:00:06      0 00:00:00.300000  0.0
10 2000-01-01 23:00:06.300000      4 00:00:00.700000  2.8
11        2000-01-01 23:00:07      4             NaT  NaN

Then do the resample to seconds (sum the value*seconds):

In [18]: df3.set_index('index')['tot'].resample('S', how='sum')
Out[18]:
index
2000-01-01 23:00:00    9.0
2000-01-01 23:00:01    5.2
2000-01-01 23:00:02    0.0
2000-01-01 23:00:03    0.0
2000-01-01 23:00:04    0.0
2000-01-01 23:00:05    0.0
2000-01-01 23:00:06    2.8
2000-01-01 23:00:07    NaN
Freq: S, dtype: float64

Note: The end point need some coercing (sum is being clever and ignoring the NaN)...

Comments