Gilberto Gilberto - 1 month ago 8
Python Question

How to get pandas dataframe where columns are the subsequent n-elements from another column dataframe?

A very simple example just for understanding.

I have the following pandas dataframe:

import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df
A
0 1
1 2
2 13
3 14
4 25
5 26
6 37
8 38


Set
n = 3


First example



How to get a new dataframe
df1
(in an efficient way), like the following:

D1 D2 D3 T
0 1 2 13 14
1 2 13 14 25
2 13 14 25 26
3 14 25 26 37
4 25 26 37 38


Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). In the 1st example the target (e.g 25) depends on the preceding n-elements (2, 13, 14).

Second example



What if the target is some element ahead (e.g.+3)?

D1 D2 D3 T
0 1 2 13 26
1 2 13 14 37
2 13 14 25 38


Thank you for your help,

Gilberto

P.S. If you think that the title can be improved, please suggest me how to modify it.

Answer

Let's see how we can solve it with NumPy tools. So, let's imagine you have the column data as a NumPy array, let's call it a. For such sliding windowed operations, we have a very efficient tool in NumPy as strides, as they are views into the input array without actually making copies.

Let's directly use the methods with the sample data and start with case #1 -

In [29]: a  # Input data
Out[29]: array([ 1,  2, 13, 14, 25, 26, 37, 38])

In [30]: m = a.strides[0] # Get strides

In [31]: n = 3 # parameter

In [32]: nrows = a.size - n # Get number of rows in o/p

In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))

In [34]: a2D
Out[34]: 
array([[ 1,  2, 13, 14],
       [ 2, 13, 14, 25],
       [13, 14, 25, 26],
       [14, 25, 26, 37],
       [25, 26, 37, 38]])

In [35]: np.may_share_memory(a,a2D) 
Out[35]: True    # a2D is a view into a

Case #2 would be similar with an additional parameter for the Target column -

In [36]: n2 = 3 # Additional param

In [37]: nrows = a.size - n - n2 + 1

In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))

In [39]: part1 # These are D1, D2, D3, etc.
Out[39]: 
array([[ 1,  2, 13],
       [ 2, 13, 14],
       [13, 14, 25]])

In [43]: part2 = a[n+n2-1:] # This is target col

In [44]: part2
Out[44]: array([26, 37, 38])