keynesiancross keynesiancross - 5 months ago 40
Python Question

Python Pandas - Moving Average with uneven period lengths

I'm trying to figure out how to deal with time series data in pandas that has uneven period lengths. The first example I'm looking at is how to calculate a moving average for the last 15 days. Here is an example of the data (time is UTC)

index date_time data
46701 1/06/2016 19:27 15.00
46702 1/06/2016 19:28 18.25
46703 1/06/2016 19:30 16.50
46704 1/06/2016 19:33 17.20
46705 1/06/2016 19:34 18.18


I'm not sure if I should just fill in data so its all even 1 minute increments, or if there is a smarter way... If anyone has suggestions it would be much appreciated

Thanks - KC

Answer

You can do something like this.

  • Resample at the frequency you want (or downsampling)
    • You have to pay attention here to the resampling strategy. It has to be consistent with the meaning of your data. Here I have arbitrary used bfill (back fill that use next valid value) but another strategy could be more appropriate like ffill (forward fill that propagates the last valid value).
  • Compute a moving average.
  • Maybe you will have to deal with the index

Note: This syntax for rolling has been introduced in pandas 0.18.0. However it is possible to do the same thing in previous version with pd.rolling_mean.

# Test data
d = {'data': [15.0, 18.25, 16.5, 17.199999999999999, 18.18],
 'date_time': ['1/06/2016 19:27',
  '1/06/2016 19:28',
  '1/06/2016 19:30',
  '1/06/2016 19:33',
  '1/06/2016 19:34'],
 'index': [46701, 46702, 46703, 46704, 46705]}

df = DataFrame(d)
df['date_time'] = pd.to_datetime(df['date_time'])

# Setting the date as the index
df.set_index('date_time', inplace=True)
# Resampling data
df = df.resample('1T').bfill()
# Performing moving average
df['moving'] = df['data'].rolling(window=3, center=True).mean()
df.plot(y=['data', 'moving'])
df
                      data  index     moving
date_time                                   
2016-01-06 19:27:00  15.00  46701        NaN
2016-01-06 19:28:00  18.25  46702  16.583333
2016-01-06 19:29:00  16.50  46703  17.083333
2016-01-06 19:30:00  16.50  46703  16.733333
2016-01-06 19:31:00  17.20  46704  16.966667
2016-01-06 19:32:00  17.20  46704  17.200000
2016-01-06 19:33:00  17.20  46704  17.526667
2016-01-06 19:34:00  18.18  46705        NaN

plot

Edit

Here is an example with missing data.

# Random data parameters
num_sample = (0, 100)
nb_sample = 1000
start_date = '2016-06-02'
freq = '2T'

random_state = np.random.RandomState(0)

# Generating random data
df = pd.DataFrame({'data': random_state.randint(num_sample[0], num_sample[1], nb_sample)},
                          index=random_state.choice(
                              pd.date_range(start=pd.to_datetime(start_date), periods=nb_sample * 3,
                                            freq=freq),
                              nb_sample))
# Removing duplicate index
df = df.groupby(df.index).first()
# Removing data for closed periods
df.loc[(df.index.hour >= 22) | (df.index.hour <= 7),'data'] = np.nan
# Resampling
df = df.resample('1T').ffill()
# Moving average by hours
df['avg'] = df['data'].rolling(window=60).mean()

ax = df.plot(kind='line', subplots=True)

enter image description here

Comments