Shiraz Amod Shiraz Amod - 3 years ago 97
R Question

Group rows into rolling sets of 3 and combine each set into a single row

I have a data.frame which currently has one record per row, but I would like to transform it to have three records per row (to provide more trend data to a machine learning algorithm).

As an example, my data.frame currently looks like this (but with many more variables than just Rank and Speed):

Date | Participant | Ctry | Rank | Speed
----- |-------------|------|------|-------
17/01 | 1 | AU | 1 | 0.9
18/01 | 1 | AU | 4 | 0.6
19/01 | 1 | AU | 2 | 0.7
20/01 | 1 | AU | 1 | 0.4
17/01 | 2 | ZA | 5 | 0.3
18/01 | 2 | ZA | 3 | 0.5
19/01 | 2 | ZA | 4 | 0.6


I want to transform it to look like this (in rolling windows of 3 for each Participant):

StartDate | Participant | Ctry | Rank_1 | Rank_2 | Rank_3 | Speed_1 | Speed_2 | Speed_3
---------- | ----------- | ---- | ------ | ------ | ------ | ------- | ------- | -------
17/01 | 1 | AU | 1 | 4 | 2 | 0.9 | 0.6 | 0.7
18/01 | 1 | AU | 4 | 2 | 1 | 0.6 | 0.7 | 0.4
17/01 | 2 | ZA | 5 | 3 | 4 | 0.3 | 0.5 | 0.6


I could use nested
for
loops to create this data structure but I'm sure there is a much more efficient way of doing this. I have looked into reshape(2) and dplyr functions but can't find something that works for rolling windows with multiple variables.

Answer Source

The OP has requested to reshape the data from long form to a special kind of wide form where each row will contain three records, finally. E.g., there will be one row for Participant 1 which contains the values of 17/01, 18/01, and 19/01, and a second row which contains the values of 18/01, 19/01, and 20/01.

Note that this operation will add redundant data as some values may appear up to three times after reshaping. Also note that the OP has requested to reshape multiple value variables simultaneously. This is a feature which has been added to the recent versions of the data.table package.

Below is a solution using shift(), melt(), dcast(), rowid(), and join from the data.table package:

library(data.table)
# define number of records per row
n_recs <- 3L
# create sequences of dates to be included per row using shift() with multiple offsets,
# keep only complete sequences, add StartDate column for later dcast()
windows <- na.omit(DT[, shift(Date, seq_len(n_recs) - 1L, type = "lead"), by = Participant])[
  , StartDate := V1]
# reshape to long form for later join, 
# rename variables for automatic creation of column names in dcast()
lwin <- melt(windows, id.vars = c("Participant", "StartDate"), value.name = "Date")[
    , variable := stringi::stri_replace(variable, fixed = "V", "")]
# right join with original data to create additional rows,
# reshape from long to wide form using multiple value vars,
# reorder for convenience 
dcast(
  DT[lwin, on = .(Participant, Date)], 
  StartDate + Participant + Ctry ~ variable, value.var = c("Rank", "Speed"))[
    order(Participant, StartDate)]
   StartDate Participant Ctry Rank_1 Rank_2 Rank_3 Speed_1 Speed_2 Speed_3
1:     17/01           1   AU      1      4      2     0.9     0.6     0.7
2:     18/01           1   AU      4      2      1     0.6     0.7     0.4
3:     17/01           2   ZA      5      3      4     0.3     0.5     0.6

Data

library(data.table)
DT <- fread(
  "Date  | Participant | Ctry | Rank | Speed
  17/01 | 1           | AU   | 1    | 0.9   
  18/01 | 1           | AU   | 4    | 0.6   
  19/01 | 1           | AU   | 2    | 0.7   
  20/01 | 1           | AU   | 1    | 0.4   
  17/01 | 2           | ZA   | 5    | 0.3   
  18/01 | 2           | ZA   | 3    | 0.5   
  19/01 | 2           | ZA   | 4    | 0.6   ",
  sep = "|"
)

Edit

I've recognized that above code relies on the implicit assumption that there are at least as many records per participants as records should be combined. OP's sample data contain 4 rows for participant 1 and 3 rows for participant 2 so this condition is met.

However, in case of only one or two rows per participant, na.omit() will remove these participants completely from the final result. Perhaps, this might be desirable for OP's objectives. If not, the code needs to be modified as follows:

# create new sample data including cases with less than 3 records per participant
DT <- fread(
  "Date  | Participant | Ctry | Rank | Speed
  17/01 | 1           | AU   | 1    | 0.9   
  18/01 | 1           | AU   | 4    | 0.6   
  19/01 | 1           | AU   | 2    | 0.7   
  20/01 | 1           | AU   | 1    | 0.4   
  17/01 | 2           | ZA   | 5    | 0.3   
  18/01 | 2           | ZA   | 3    | 0.5   
  19/01 | 2           | ZA   | 4    | 0.6   
  17/01 | 3           | DE   | 2    | 0.8,
  17/01 | 4           | DK   | 3    | 0.8,
  18/01 | 4           | DK   | 4    | 0.8",
  sep = "|"
) 

# modified code
n_recs <- 3L
min_rows <- 1L
windows <- DT[, lapply(shift(Date, seq_len(n_recs) - 1L, type = "lead"), 
                       head, n = pmax(.N - n_recs + 1L, min_rows)), 
              by = Participant][, StartDate := V1]
lwin <- melt(windows, id.vars = c("Participant", "StartDate"), value.name = "Date", 
             na.rm = TRUE)[
  , variable := stringi::stri_replace(variable, fixed = "V", "")]
dcast(
  DT[lwin, on = .(Participant, Date)], 
  StartDate + Participant + Ctry ~ variable, value.var = c("Rank", "Speed"))[
    order(Participant, StartDate)]
   StartDate Participant Ctry Rank_1 Rank_2 Rank_3 Speed_1 Speed_2 Speed_3
1:     17/01           1   AU      1      4      2     0.9     0.6     0.7
2:     18/01           1   AU      4      2      1     0.6     0.7     0.4
3:     17/01           2   ZA      5      3      4     0.3     0.5     0.6
4:     17/01           3   DE      2     NA     NA    0.8,      NA      NA
5:     17/01           4   DK      3      4     NA    0.8,     0.8      NA

Please note the "incomplete" rows 4 and 5 due to the lack of input data for participants 3 and 4. However, it is assured that all participants appear in the final result.

This is achieved by explicitely limiting the number of rows created for each participant by using head() when computing windows. In addition, melt() must be called with parameter na.rm = TRUE now.

If min_rows is set to 0L then the incomplete rows 4 and 5 will vanish from the final result.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download