Christie Haskell Christie Haskell - 3 years ago 114
Python Question

Transform sequence into array of sequences

I have a pyspark dataframe where there are 30 observations per unique ID like so:

id time features
1 0 [1,2,3]
1 1 [4,5,6]
.. .. ..
1 29 [7,8,9]
2 0 [0,1,2]
2 1 [3,4,5]
.. .. ..
2 29 [6,7,8]
.. .. ..


What I need to do is create an array of sequences to feed into a keras neural network. So, for example let's say I have the following smaller dataset for one id:

id time features
1 0 [1,2,3]
1 1 [4,5,6]
1 2 [7,8,9]


The desired data format is:

[[[1,2,3]
[0,0,0]
[0,0,0]],
[[1,2,3],
[4,5,6],
[0,0,0]],
[[1,2,3],
[4,5,6],
[7,8,9]]]


I can use the
pad_sequences
function from the keras package to add the [0,0,0] rows so what I really need to be able to do is to create the following array for all ids.

[[[1,2,3]],
[[1,2,3],
[4,5,6]],
[[1,2,3],
[4,5,6],
[7,8,9]]]


The only way I can think to do it is with loops, something like this:

x = []
for i in range(10000):
user = x_train[i]
arr = []
for j in range(30):
arr.append(user[0:j])
x.append(arr)


A loop solution isn't feasible though. I have 904 batches of 10,000 unique ids each with 30 observations. I'm collecting one batch at a time into a numpy array so a numpy solution is fine. A pyspark solution using rdds would be awesome. Something using
map
perhaps?

Answer Source

Here is a numpy solution that creates the desired output including zeros. It uses triu_indices to create the "cumulative time series structure":

import numpy as np
from timeit import timeit

def time_series(nids, nsteps, features):
    f3d = np.reshape(features, (nids, nsteps, -1))
    f4d = np.zeros((nids, nsteps, nsteps, f3d.shape[-1]), f3d.dtype)
    i, j = np.triu_indices(nsteps)
    f4d[:, j, i, :] = f3d[:, i, :]
    return f4d

nids = 2
nsteps = 4
nfeatures = 3
features = np.random.randint(1, 100, (nids * nsteps, nfeatures))

print('small example', time_series(nids, nsteps, features))

nids = 10000
nsteps = 30
nfeatures = 3
features = np.random.randint(1, 100, (nids * nsteps, nfeatures))

print('time needed for big example {:6.4f} secs'.format(
    timeit(lambda: time_series(nids, nsteps, features), number=10)/10))

output:

small example [[[[76 53 48]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [61 90 69]]]


 [[[68 32 20]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [28 73 78]]]]
time needed for big example 0.2251 secs
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download