Edit: I condensed this question given that it was probably too involved to begin with. The meat of the question is in bold below.
I'd like to know more about the object that is actually created when using
def rwindows(a, window):
if a.ndim == 1:
a = a.reshape(-1, 1)
shape = a.shape - window + 1, window, a.shape[-1]
strides = (a.strides,) + a.strides
windows = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# as_strided version
a = np.arange(5)
print(rwindows(a, 3)) # 1d input
[[0 1 2]
[1 2 3]
[2 3 4]]
b = np.arange(10).reshape(5,2)
print(rwindows(b, 4)) # 2d input
produce a single value from an ndarray input *args and **kwargs are
passed to the function
def prod(a, b):
return a * b
df.rolling(3).apply(prod, args=((df + 2).rolling(3),))
TypeError: unsupported operand type(s) for *: 'float' and 'Rolling'
I suggest you have a look at the source code in order to get into the nitty gritty of what rolling does. In particular I suggest you have a look at the
rolling functions in generic.py and window.py. From there you can have a look at the
Window class which is used if you specify a window type or the default
Rolling class. The last one inherits from
_Rolling_and_Expanding and ultimately
That said, I'll give my two cents: Pandas' whole rolling mechanism relies on the numpy function
apply_along_axis. In particular it is used here in pandas. It is used in conjunction with the
windows.pyx cython module. In goes your series, out comes the aggregated rolling window. For typical aggregation functions it handles them for you efficiently, but for custom ones (using
apply()) it uses a
The rolling function in pandas operates on pandas data frame columns independently. It is not a python iterator, and is lazy loaded, meaning nothing is computed until you apply an aggregation function to it. The functions which actually apply the rolling window of data aren't used until right before an aggregation is done.
A source of confusion might be that you're thinking of the rolling object as a dataframe. (You have named the rolling object
df in your last code snippet). It really isn't. It is an object which can produce dataframes by applying aggregations over the window logic it houses.
The lambda you are supplying is applied for each cell of your new dataframe. It takes a window backwards (along each column) in your old dataframe, and it aggregates it to one single cell in the new dataframe. The aggregation can be things like
mean, something custom you've made, etc., over some window size, say 3. Here are some examples:
a = np.arange(5) df = pd.DataFrame(a, columns=['a']) df.rolling(3).mean().dropna()
... which can also be done by:
... and produces:
a 2 3.0 3 6.0 4 9.0
(The first column is the index value and can be ignored here, and for the next examples.)
Notice how we supplied an existing numpy aggregation function. That's the idea. We're supposed to be able to supply anything we want as long as it conforms to what aggregation functions do, i.e., take a vector of values and produce a single value from it. Here is another one where we create a custom aggregation function, in this case the L2 norm of the window:
df.rolling(3).apply(lambda x: np.sqrt(x.dot(x))).dropna()
if you're not familiar with lambda functions this is the same as:
def euclidean_dist(x): return np.sqrt(x.dot(x)) df.rolling(3).apply(euclidean_dist).dropna()
a 2 2.236068 3 3.741657 4 5.385165
Just to make sure, we can manually check that
np.sqrt(0**2 + 1**2 + 2**2) is indeed
[In your original edit, in the] last code snippet, your code is probably failing early than you expect. It is failing before the invocation of
df.apply(...) You are trying to add a rolling object named
df to the number 2 before it is passed to
df.apply(...). The rolling object isn't something you do operations on. The aggregation function you have supplied also doesn't conform to an aggregation function in general. The
a is a list with the values of a window,
b would be a constant extra parameter you pass in. It can be a rolling object if you want, but it wouldn't typically be something you would like to do. To make it more clear, here is something which is similar to what you were doing in your original edit but works:
a = np.arange(8) df = pd.DataFrame(a, columns=['a']) n = 4 rol = df.rolling(n) def prod(window_list, constant_rol): return window_list.dot(constant_rol.sum().dropna().head(n)) rol.apply(prod, args=(rol,)).dropna() # [92.0, 140.0, 188.0, 236.0, 284.0]
It is a contrived example, but I'm showing it to make the point that you can pass in whatever you want as a constant, even the rolling object you are using itself. The dynamic part is the first argument
a in your case or
window_list in my case. All defined windows, in the form of individual lists, are passed into that function one by one.
Based on your followup comments this might be what you're looking for:
import numpy as np import pandas as pd n = 3 a = np.arange(5) df = pd.DataFrame(a, columns=['a']) def keep(window, windows): windows.append(window.copy()) return window[-1] windows = list() df['a'].rolling(n).apply(keep, args=(windows,)) df = df.tail(n) df['a_window'] = windows
which adds arrays/vectors to each rolling block thus producing:
a a_window 2 2 [0.0, 1.0, 2.0] 3 3 [1.0, 2.0, 3.0] 4 4 [2.0, 3.0, 4.0]
Note that it only works if you do it on a column at a time. If you want to do some math on the window before you store it away in
keep that is fine too.
That said, without more input on exactly what you are trying to achieve it is hard to construct an example which suits your needs.
If your ultimate goal is to create a dataframe of lagging variables then I'd go for using real columns using
import numpy as np import pandas as pd a = np.arange(5) df = pd.DataFrame(a, columns=['a']) for i in range(1,3): df['a-%s' % i] = df['a'].shift(i) df.dropna()
a a-1 a-2 2 2 1.0 0.0 3 3 2.0 1.0 4 4 3.0 2.0
(There might be some more beautiful way of doing it, but it gets the job done.)
Regarding your variable
b in your first code snippet, remember DataFrames in pandas aren't typically handled as tensors of arbitrary dimensions/object. You can probably stuff whatever you want into it, but ultimately strings, time objects, ints and floats is what is expected. That might be the reasons the designers of pandas haven't bothered with allowing rolling aggregation to non-scalar values. It doesn't even seem like a simple string is allowed as output of the aggregation function.
Anyway, I hope this answer some of your questions. If not let me know, and I'll try to help you out in the comments, or an update.
Final note on the
_create_blocks() function of rolling objects.
_create_blocks() function handles the reindexing and binning when you use the
freq argument of
If you use freq with, say, weeks such that
import pandas as pd a = np.arange(50) df = pd.DataFrame(a, columns=['a']) df.index = pd.to_datetime('2016-01-01') + pd.to_timedelta(df['a'], 'D') blocks, obj, index = df.rolling(4, freq='W')._create_blocks(how=None) for b in blocks: print(b)
... then we get the binned (not rolling) original data week-by-week:
a a 2016-01-03 2.0 2016-01-10 9.0 2016-01-17 16.0 2016-01-24 23.0 2016-01-31 30.0 2016-02-07 37.0 2016-02-14 44.0 2016-02-21 NaN
Notice that this isn't the output of the aggregated rolling. This is simply the new blocks it works on. After this. We do an aggregation like
sum and get:
a a 2016-01-03 NaN 2016-01-10 NaN 2016-01-17 NaN 2016-01-24 50.0 2016-01-31 78.0 2016-02-07 106.0 2016-02-14 134.0 2016-02-21 NaN
... which checks out with a test summation: 50 = 2 + 9 + 16 + 23.
If you don't use
freq as an argument it simply returns the original data structure:
import pandas as pd a = np.arange(5) df = pd.DataFrame(a, columns=['a']) blocks, obj, index = df.rolling(3)._create_blocks(how=None) for b in blocks: print(b)
... which produces ...
a a 2016-01-01 0 2016-01-02 1 2016-01-03 2 2016-01-04 3 2016-01-05 4
... and is used for rolling window aggregation.