Ben - 1 year ago 89
R Question

# Elegant way to identify runs inside a data.table

I've run into this problem twice in the last two weeks alone, so I figured it's worth a post. I'm trying to identify "runs" inside a

data.table
, but I can't figure out an elegant way to do it.

## Example

set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]

ID Char
1: 9 A
2: 10 B
3: 20 C
4: 42 C
5: 2 D
6: 4 D
7: 6 D
8: 18 D
...

Here, I'd like to identify and group rows where the ID is within 2 of the row above/below. Here's my currently ugly solution

# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
ID Char RunID
1: 9 A 1
2: 10 B 1
3: 20 C NA
4: 42 C NA
5: 2 D 2
6: 4 D 2
7: 6 D 2
8: 18 D NA
...

Is there a better way to do this?

EDIT: It seems there's been some confusion over how I'm defining a "run". To put it more explicitly, row_i and row_i+1 should have the same RunID if and only if their IDs are within a distance of 2.