Ben - 18 days ago 6
R Question

# Elegant way to identify runs inside a data.table

I've run into this problem twice in the last two weeks alone, so I figured it's worth a post. I'm trying to identify "runs" inside a

`data.table`
, but I can't figure out an elegant way to do it.

## Example

``````set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]

ID Char
1:  9    A
2: 10    B
3: 20    C
4: 42    C
5:  2    D
6:  4    D
7:  6    D
8: 18    D
...
``````

Here, I'd like to identify and group rows where the ID is within 2 of the row above/below. Here's my currently ugly solution

``````# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
ID Char RunID
1:  9    A     1
2: 10    B     1
3: 20    C    NA
4: 42    C    NA
5:  2    D     2
6:  4    D     2
7:  6    D     2
8: 18    D    NA
...
``````

Is there a better way to do this?

EDIT: It seems there's been some confusion over how I'm defining a "run". To put it more explicitly, row_i and row_i+1 should have the same RunID if and only if their IDs are within a distance of 2.

``````dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]
``````dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ]