Ben - 1 year ago 56
R Question

# Unexpected result using data.table's shift() by group (bug?)

Consider this dataset

# Data

``````dt <- data.table(ID = c(1,8,9,20,32,33), Char = c("A", "A", "B", "B", "C", "C"))
dt
ID Char
1:  1    A
2:  8    A
3:  9    B
4: 20    B
5: 32    C
6: 33    C
``````

I want to identify "runs" by ID, i.e. consecutive rows where the ID differs by 1, but I only want to consider runs within the same Char group. I can do this as follows

# Correct

``````dt[, InRun := FALSE]
dt[, DistToAbove := abs(ID - shift(ID, type="lag")), by=Char]
dt[, DistToBelow := abs(ID - shift(ID, type="lead")), by=Char]
dt[DistToAbove <= 1 | DistToBelow <= 1, InRun := TRUE, by=Char]
dt
ID Char InRun DistToAbove DistToBelow
1:  1    A FALSE          NA           7
2:  8    A FALSE           7          NA
3:  9    B FALSE          NA          11
4: 20    B FALSE          11          NA
5: 32    C  TRUE          NA           1
6: 33    C  TRUE           1          NA
``````

I tried simplifying the above code into the lines below, but the answer differs

# Incorrect / Unexpected

``````dt[, InRun := FALSE]
dt[abs(ID - shift(ID, type="lag")) <= 1 | abs(shift(ID, type="lead") - ID) <= 1, InRun := TRUE, by=Char]
dt
ID Char InRun DistToAbove DistToBelow
1:  1    A FALSE          NA           7
2:  8    A  TRUE           7          NA
3:  9    B  TRUE          NA          11
4: 20    B FALSE          11          NA
5: 32    C  TRUE          NA           1
6: 33    C  TRUE           1          NA
``````

What gives? (Note I'm using data.table v1.9.7)

I want to identify "runs" by ID, i.e. consecutive rows where the ID differs by 1, but I only want to consider runs within the same Char group.

Here's how I'd approach it:

``````dt[, run_id := cumsum(
( ID != shift(ID, fill = ID[1L]) + 1L )
|
( Char != shift(Char, fill = Char[1L]) )
)]
dt[, in_run := .N > 1L, by=.(Char, run_id)]

ID Char run_id in_run
1:  1    A      1  FALSE
2:  8    A      2  FALSE
3:  9    B      3  FALSE
4: 20    B      4  FALSE
5: 32    C      5   TRUE
6: 33    C      5   TRUE
``````

This code identifies all runs (including those with length of one) and then tests for length greater than one (the OP's definition).

Regarding the OP's approach:

``````dt[abs(ID - shift(ID, type="lag")) <= 1 | abs(shift(ID, type="lead") - ID) <= 1, # i
InRun := TRUE # j
, by=Char] # by
``````

In `DT[i,j,by]` the steps are: filter using `i`, then group with `by`, then calculate `j`. You can't do by-group calculations in `i` in the way attempted here.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download