xiaodai - 1 year ago 106
R Question

# How to create a lag variable within each group?

I have a data.table:

``````set.seed(1)
data <- data.table(time = c(1:3, 1:4),
groups = c(rep(c("b", "a"), c(3, 4))),
value = rnorm(7))

data
#    groups time      value
# 1:      b    1 -0.6264538
# 2:      b    2  0.1836433
# 3:      b    3 -0.8356286
# 4:      a    1  1.5952808
# 5:      a    2  0.3295078
# 6:      a    3 -0.8204684
# 7:      a    4  0.4874291
``````

I want to compute a lagged version of the "value" column, within each level of "groups".

The result should look like

``````#   groups time      value  lag.value
# 1      a    1  1.5952808         NA
# 2      a    2  0.3295078  1.5952808
# 3      a    3 -0.8204684  0.3295078
# 4      a    4  0.4874291 -0.8204684
# 5      b    1 -0.6264538         NA
# 6      b    2  0.1836433 -0.6264538
# 7      b    3 -0.8356286  0.1836433
``````

I have tried to use
`lag`
directly:

``````data\$lag.value <- lag(data\$value)
``````

...which clearly wouldn't work.

I have also tried:

``````unlist(tapply(data\$value, data\$groups, lag))
a1         a2         a3         a4         b1         b2         b3
NA -0.1162932  0.4420753  2.1505440         NA  0.5894583 -0.2890288
``````

Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.

What is the most efficient way to do this in base R, plyr, dplyr, and data.table?

You could do this within `data.table`

`````` library(data.table)
data[, lag.value:=c(NA, value[-.N]), by=groups]
data
#   time groups       value   lag.value
#1:    1      a  0.02779005          NA
#2:    2      a  0.88029938  0.02779005
#3:    3      a -1.69514201  0.88029938
#4:    1      b -1.27560288          NA
#5:    2      b -0.65976434 -1.27560288
#6:    3      b -1.37804943 -0.65976434
#7:    4      b  0.12041778 -1.37804943
``````

For multiple columns:

``````nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
data
#    time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132
``````

### Update

From `data.table` versions >= `v1.9.5`, we can use `shift` with `type` as `lag` or `lead`. By default, the type is `lag`.

``````data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
#   time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132
``````

If you need the reverse, use `type=lead`

``````nm3 <- paste("lead", nm1, sep=".")
``````

Using the original dataset

``````  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
#1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884
#3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA
#4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432
#5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999
#7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA
#1: -0.04493361
#2: -0.01619026
#3:          NA
#4:  0.82122120
#5:  0.59390132
#6:  0.91897737
#7:          NA
``````

### data

`````` set.seed(1)
data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
value = rnorm(7), value1=rnorm(7), value2=rnorm(7))
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download