Thanh Quang - 10 months ago 85
R Question

# How to group in data.table with overlapping value?

i have a question relating to data.table in R.
I am working on an acceleration data that requires me to generate features from the raw data. I want to group data by each 2 second. It is easy by generating 1 more column to indicate groups for each 2 second and group with by.
However, i want do the overlapping windows. For example, my raw data is this

``````a=data.table(x = c(1:10), y= c(2:11), z = c(5), second=rep(c(1:5),each=2))
``````

x y z second

1: 1 2 5 1

2: 2 3 5 1

3: 3 4 5 2

4: 4 5 5 2

5: 5 6 5 3

6: 6 7 5 3

7: 7 8 5 4

8: 8 9 5 4

9: 9 10 5 5

10: 10 11 5 5

Now, i want to calculate the mean of x,y,z column by each 2 seconds. 1and2, 2 and 3, 3 and 4, 4 and 5.
I can run the for loops but since i have a huge dataset, it will take a long time. Do you know how do to it with just data table tools?
Thanks so much

As there is only 2 unique observations for 'second', we get the `lead` of the 'x', 'y', 'z' columns, grouped by 'second', `unlist` the Subset of Data.table and get the `mean`.

``````nm1 <- c("x", "y", "z")
na.omit(a[, paste0(nm1, 2)  := lapply(.SD, function(x) shift(x, 2,
type = "lead")), .SDcols = nm1])[, .(Mean = mean(unlist(.SD))),
.(second = paste0(second, "-", second + 1))]
#  second     Mean
#1:    1-2 3.666667
#2:    2-3 5.000000
#3:    3-4 6.333333
#4:    4-5 7.666667
``````

Or a slightly more compact option would be

``````library(dplyr)
cbind(a[second!= last(second)], a[second!= first(second)])[
,.(Mean = mean(unlist(.SD))), .(second = paste0(second, "-", second+1))]
#   second     Mean
#1:    1-2 3.666667
#2:    2-3 5.000000
#3:    3-4 6.333333
#4:    4-5 7.666667
``````

Or another option would be place them in a `list`, `rbind` the dataset, create a new 'id1' column, get the `mean` after `unlist`ing the `.SDcols` or we can get the individual `mean` of each column

``````dt1 <- rbindlist(list(a[second!= last(second)],
a[second!= first(second)]), idcol=TRUE)[, id1:= as.numeric(gl(.N, 2, .N)), .id][]
``````

Get the `mean` for each column by 'second'

``````dt1[, lapply(.SD, mean), .(second = paste0(id1, "-", id1 + 1)), .SDcols = x:z]
``````

Get the whole `mean` by 'second'

``````dt1[, mean(unlist(.SD)), .(second = paste0(id1, "-", id1 +1)), .SDcols = x:z]
``````