Thanh Quang Thanh Quang - 3 months ago 20
R Question

How to group in data.table with overlapping value?

i have a question relating to data.table in R.
I am working on an acceleration data that requires me to generate features from the raw data. I want to group data by each 2 second. It is easy by generating 1 more column to indicate groups for each 2 second and group with by.
However, i want do the overlapping windows. For example, my raw data is this

a=data.table(x = c(1:10), y= c(2:11), z = c(5), second=rep(c(1:5),each=2))



x y z second

1: 1 2 5 1

2: 2 3 5 1

3: 3 4 5 2

4: 4 5 5 2

5: 5 6 5 3

6: 6 7 5 3

7: 7 8 5 4

8: 8 9 5 4

9: 9 10 5 5

10: 10 11 5 5


Now, i want to calculate the mean of x,y,z column by each 2 seconds. 1and2, 2 and 3, 3 and 4, 4 and 5.
I can run the for loops but since i have a huge dataset, it will take a long time. Do you know how do to it with just data table tools?
Thanks so much

Answer

As there is only 2 unique observations for 'second', we get the lead of the 'x', 'y', 'z' columns, grouped by 'second', unlist the Subset of Data.table and get the mean.

nm1 <- c("x", "y", "z")
na.omit(a[, paste0(nm1, 2)  := lapply(.SD, function(x) shift(x, 2, 
     type = "lead")), .SDcols = nm1])[, .(Mean = mean(unlist(.SD))),
           .(second = paste0(second, "-", second + 1))]
#  second     Mean
#1:    1-2 3.666667
#2:    2-3 5.000000
#3:    3-4 6.333333
#4:    4-5 7.666667

Or a slightly more compact option would be

library(dplyr)
cbind(a[second!= last(second)], a[second!= first(second)])[
    ,.(Mean = mean(unlist(.SD))), .(second = paste0(second, "-", second+1))]
#   second     Mean
#1:    1-2 3.666667
#2:    2-3 5.000000
#3:    3-4 6.333333
#4:    4-5 7.666667

Or another option would be place them in a list, rbind the dataset, create a new 'id1' column, get the mean after unlisting the .SDcols or we can get the individual mean of each column

dt1 <- rbindlist(list(a[second!= last(second)], 
     a[second!= first(second)]), idcol=TRUE)[, id1:= as.numeric(gl(.N, 2, .N)), .id][]

Get the mean for each column by 'second'

dt1[, lapply(.SD, mean), .(second = paste0(id1, "-", id1 + 1)), .SDcols = x:z]

Get the whole mean by 'second'

dt1[, mean(unlist(.SD)), .(second = paste0(id1, "-", id1 +1)), .SDcols = x:z]
Comments