nate - 1 year ago 64
R Question

# Does data.table implement fast range subsetting based on binary search? What is that syntax?

I have a long vector of floats. I would like to repeatedly find subsets of that vector within various ranges. My current syntax (

`DT[x > 1.8 & x < 2.9]`
) appears to vector scan (it is relatively slow).

Is there a faster syntax? I have not been able to track one down.

Example:

``````x = runif(1E6)
DT = data.table(x, key = "x")

# For foverlaps()
DT[,xtemp:=x]
range = data.table(start = 0.04, end = 0.5, key=c("start", "end"))

microbenchmark::microbenchmark(
DT[x < 0.5 & x > 0.04],
x[x < 0.5 & x > 0.04],
foverlaps(DT, range, by.x = c("x", "xtemp"))
)

Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
DT[x < 0.5 & x > 0.04]  12.65391  16.10852  18.43412  17.23268  17.76868 104.1112   100
x[x < 0.5 & x > 0.04]  16.48126  19.63882  21.65813  20.31534  20.95264 113.7965   100
foverlaps(DT, range, by.x = c("x", "xtemp")) 116.72732 131.93093 145.56821 140.09218 146.33287 226.6069   100
``````

`foverlaps()`

Edit: Here is an update to the accepted answer by Psidom. This modification handles two edge cases, where your search range extends beyond the first or last row. I found a 50x speedup with my data and this approach.

``````DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; ind[1][is.na(ind[1])] = 1; ind[2][is.na(ind[2])] = nrow(DT); (ind[1]+1):ind[2]}]
``````

Based on the answer here, this seems to be some sort of improvement. values equal to 0.5 will be included in this scenario though:

``````bs <- function() DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; (ind[1]+1):ind[2]}]
vs <- function() x[x < 0.5 & x > 0.04]

x = runif(1E6)
DT = data.table(x, key = "x")

microbenchmark::microbenchmark(
bs(),
vs()
)

#Unit: milliseconds
# expr       min        lq      mean   median        uq        max neval
# bs()  3.594993  4.150932  5.002947  4.44695  4.952283   9.750284   100
# vs() 15.054460 16.217198 18.999877 17.45298 19.554958 113.623699   100
``````

If we modify `vs()` to be:

``````vs <- function() x[x <= 0.5 & x > 0.04]
``````

The results from two methods are the same:

``````identical(bs()\$x, sort(vs()))
# [1] TRUE
``````