xiaodai xiaodai - 2 months ago 7
R Question

How to create a lag variable within each group?

I have a data.table:

set.seed(1)
data <- data.table(time = c(1:3, 1:4),
groups = c(rep(c("b", "a"), c(3, 4))),
value = rnorm(7))

data
# groups time value
# 1: b 1 -0.6264538
# 2: b 2 0.1836433
# 3: b 3 -0.8356286
# 4: a 1 1.5952808
# 5: a 2 0.3295078
# 6: a 3 -0.8204684
# 7: a 4 0.4874291


I want to compute a lagged version of the "value" column, within each level of "groups".

The result should look like

# groups time value lag.value
# 1 a 1 1.5952808 NA
# 2 a 2 0.3295078 1.5952808
# 3 a 3 -0.8204684 0.3295078
# 4 a 4 0.4874291 -0.8204684
# 5 b 1 -0.6264538 NA
# 6 b 2 0.1836433 -0.6264538
# 7 b 3 -0.8356286 0.1836433


I have tried to use
lag
directly:

data$lag.value <- lag(data$value)


...which clearly wouldn't work.

I have also tried:

unlist(tapply(data$value, data$groups, lag))
a1 a2 a3 a4 b1 b2 b3
NA -0.1162932 0.4420753 2.1505440 NA 0.5894583 -0.2890288


Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.

What is the most efficient way to do this in base R, plyr, dplyr, and data.table?

Answer

You could do this within data.table

 library(data.table)
 data[, lag.value:=c(NA, value[-.N]), by=groups]
  data
 #   time groups       value   lag.value
 #1:    1      a  0.02779005          NA
 #2:    2      a  0.88029938  0.02779005
 #3:    3      a -1.69514201  0.88029938
 #4:    1      b -1.27560288          NA
 #5:    2      b -0.65976434 -1.27560288
 #6:    3      b -1.37804943 -0.65976434
 #7:    4      b  0.12041778 -1.37804943

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
 data
#    time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
#   time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".")

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
  #  time groups      value     value1      value2 lead.value lead.value1
  #1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814
  #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884
  #3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA
  #4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432
  #5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406
  #6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999
  #7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA
 #   lead.value2
 #1: -0.04493361
 #2: -0.01619026
 #3:          NA
 #4:  0.82122120
 #5:  0.59390132
 #6:  0.91897737
 #7:          NA

data

 set.seed(1)
 data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
             value = rnorm(7), value1=rnorm(7), value2=rnorm(7))