xiaodai - 3 months ago 10
R Question

# R: How to create a lag variable for each by group?

I have a data.table as below

``````set.seed(1)
data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))), value = rnorm(7))
``````

which looks like

``````  groups time      value
1:      b    1 -0.6264538
2:      b    2  0.1836433
3:      b    3 -0.8356286
4:      a    1  1.5952808
5:      a    2  0.3295078
6:      a    3 -0.8204684
7:      a    4  0.4874291
``````

I wanted to create a dataset where the

value

column is lagged whithin each group of

groups

the result should look like

``````  groups time      value  lag.value
1      a    1  1.5952808         NA
2      a    2  0.3295078  1.5952808
3      a    3 -0.8204684  0.3295078
4      a    4  0.4874291 -0.8204684
5      b    1 -0.6264538         NA
6      b    2  0.1836433 -0.6264538
7      b    3 -0.8356286  0.1836433
``````

Using lag directly as in

``````data\$lag.value <- lag(data\$value)
``````

clearly wouldn't work.

I tried

``````unlist(tapply(data\$value,data\$groups,lag))
``````

which gives

``````   a1         a2         a3         a4         b1         b2         b3
NA -0.1162932  0.4420753  2.1505440         NA  0.5894583 -0.2890288
``````

Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.

What is the most efficient way to do this in base R, plyr, dplyr, and data.table (and whichever other package)?

You could do this within `data.table`

`````` library(data.table)
data[, lag.value:=c(NA, value[-.N]), by=groups]
data
#   time groups       value   lag.value
#1:    1      a  0.02779005          NA
#2:    2      a  0.88029938  0.02779005
#3:    3      a -1.69514201  0.88029938
#4:    1      b -1.27560288          NA
#5:    2      b -0.65976434 -1.27560288
#6:    3      b -1.37804943 -0.65976434
#7:    4      b  0.12041778 -1.37804943
``````

For multiple columns:

``````nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
data
#    time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132
``````

### Update

From `data.table` versions >= `v1.9.5`, we can use `shift` with `type` as `lag` or `lead`. By default, the type is `lag`.

``````data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
#   time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132
``````

If you need the reverse, use `type=lead`

``````nm3 <- paste("lead", nm1, sep=".")
``````

Using the original dataset

``````  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
#1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884
#3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA
#4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432
#5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999
#7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA
#1: -0.04493361
#2: -0.01619026
#3:          NA
#4:  0.82122120
#5:  0.59390132
#6:  0.91897737
#7:          NA
``````

### data

`````` set.seed(1)
data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
value = rnorm(7), value1=rnorm(7), value2=rnorm(7))
``````