piRSquared - 1 month ago 8
Python Question

# daily data, resample every 3 days, calculate over trailing 5 days efficiently

consider the

`df`

``````tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df
``````

I want to calculate the sum over a trailing 5 days, every 3 days.

I expect something that looks like this

this was edited

what I had was incorrect. @ivan_pozdeev and @boud noticed this was a centered window and that was not my intention. Appologies for the confusion.

everyone's solutions capture much of what I was after.

criteria

• I'm looking for smart efficient solutions that can be scaled to large data sets.

• I'll be timing solutions and also considering elegance.

• Solutions should also be generalizable for a variety of sample and look back frequencies.

• I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.

• for the sample above, the look back is
`5D`
and there may be 4 or 50 observations that fall within that look back.

• I want the timestamp to be the last observed timestamp within the look back period.

Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.

Scenario #1 : Multiple entries per date, but no missing dates

Approach #1 :

``````# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
# Extract the index names and values
vals = df.A.values
indx = df.index.values

# Extract IDs for bin based summing
search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)

# Perform bin based summing and subtract the repeated ones
IDsums = np.bincount(id_arr,vals)
allsums = IDsums[:-1] + IDsums[1:]
allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]

# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
``````

Approach #2 :

``````# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
# Extract the index names and values
indx = df.index.values

# Extract IDs for bin based summing

# Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern
# Pad with 0 and length of array at either ends as we use diff later on
shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
search_id = np.hstack((0,shiftIDs,date_id[-1]+1))

# Find the start of those shifting indices
# Generate ID based on shifts and do bin based summing of dataframe
shifts = np.searchsorted(date_id,search_id)
reps = shifts[1:] - shifts[:-1]
id_arr = np.repeat(np.arange(len(reps)),reps)
IDsums = np.bincount(id_arr,df.A.values)

# Sum each group of 3 elems with a stride of 2, make dataframe if needed
allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]

# Convert to pandas dataframe if needed
out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
return pd.DataFrame(allsums,index=out_index,columns=['A'])
``````

Approach #3 :

``````def vectorized_app3(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
``````

We could replace the convolution part with direct sliced summation for a modified version of it -

``````def vectorized_app3_v2(df, S=3, W=5):
dt = df.index.values
shifts = np.append(False,dt[1:] > dt[:-1])
c = np.bincount(shifts.cumsum(),df.A.values)
f = c.size+S-W
out = c[:f:S].copy()
for i in range(1,W):
out += c[i:f+i:S]
out_index = dt[np.nonzero(shifts)[0][W-2::S]]
return pd.DataFrame(out,index=out_index,columns=['A'])
``````

Scenario #2 : Multiple entries per date and missing dates

Approach #4 :

``````def vectorized_app4(df, S=3, W=5):
dt = df.index.values
indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
WL = ((indx[-1]+1)//S)
c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
freq_str = str(S)+'D'
grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
return pd.DataFrame(out,index=out_index,columns=['A'])
``````

Scenario #3 : Consecutive dates and exactly one entry per date

Approach #5 :

``````def vectorized_app5(df, S=3, W=5):
vals = df.A.values
N = (df.shape[0]-W+2*S-1)//S
n = vals.strides[0]
out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
strides=(S*n,n)).sum(1)
index_idx = (W-1)+S*np.arange(N)
out_index = df.index[index_idx]
return pd.DataFrame(out,index=out_index,columns=['A'])
``````

Suggestions for creating test-data

Scenario #1 :

``````# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])
``````

Scenario #2 :

To create setup for multiple dates and with missing dates, we could just edit the `df_data` creation step, like so -

``````df_data = np.random.randint(0,9,(len(idx0)))
``````

Scenario #3 :

``````# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
``````