Ben Ben - 18 days ago 6
R Question

Elegant way to identify runs inside a data.table

I've run into this problem twice in the last two weeks alone, so I figured it's worth a post. I'm trying to identify "runs" inside a

data.table
, but I can't figure out an elegant way to do it.

Example



set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]

ID Char
1: 9 A
2: 10 B
3: 20 C
4: 42 C
5: 2 D
6: 4 D
7: 6 D
8: 18 D
...


Here, I'd like to identify and group rows where the ID is within 2 of the row above/below. Here's my currently ugly solution

# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
ID Char RunID
1: 9 A 1
2: 10 B 1
3: 20 C NA
4: 42 C NA
5: 2 D 2
6: 4 D 2
7: 6 D 2
8: 18 D NA
...


Is there a better way to do this?




EDIT: It seems there's been some confusion over how I'm defining a "run". To put it more explicitly, row_i and row_i+1 should have the same RunID if and only if their IDs are within a distance of 2.

Answer

I would stop after making this run ID:

dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]

But to get the OP's run ID (which ignores length-one runs), here are a couple ways to go:

dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ]
# or
dt[, run_len := .N, by=run_id0 ][ run_len > 1L, run_id2 := .GRP, by=run_id0 ]
Comments