nate nate - 7 days ago 6
R Question

Does data.table implement fast range subsetting based on binary search? What is that syntax?

I have a long vector of floats. I would like to repeatedly find subsets of that vector within various ranges. My current syntax (

DT[x > 1.8 & x < 2.9]
) appears to vector scan (it is relatively slow).

Is there a faster syntax? I have not been able to track one down.

Example:

x = runif(1E6)
DT = data.table(x, key = "x")

# For foverlaps()
DT[,xtemp:=x]
range = data.table(start = 0.04, end = 0.5, key=c("start", "end"))


microbenchmark::microbenchmark(
DT[x < 0.5 & x > 0.04],
x[x < 0.5 & x > 0.04],
foverlaps(DT, range, by.x = c("x", "xtemp"))
)

Unit: milliseconds
expr min lq mean median uq max neval
DT[x < 0.5 & x > 0.04] 12.65391 16.10852 18.43412 17.23268 17.76868 104.1112 100
x[x < 0.5 & x > 0.04] 16.48126 19.63882 21.65813 20.31534 20.95264 113.7965 100
foverlaps(DT, range, by.x = c("x", "xtemp")) 116.72732 131.93093 145.56821 140.09218 146.33287 226.6069 100


Edit: Added
foverlaps()


Edit: Here is an update to the accepted answer by Psidom. This modification handles two edge cases, where your search range extends beyond the first or last row. I found a 50x speedup with my data and this approach.

DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; ind[1][is.na(ind[1])] = 1; ind[2][is.na(ind[2])] = nrow(DT); (ind[1]+1):ind[2]}]

Answer

Based on the answer here, this seems to be some sort of improvement. values equal to 0.5 will be included in this scenario though:

bs <- function() DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; (ind[1]+1):ind[2]}]
vs <- function() x[x < 0.5 & x > 0.04]

x = runif(1E6)
DT = data.table(x, key = "x")

microbenchmark::microbenchmark(
    bs(), 
    vs()
)

#Unit: milliseconds
# expr       min        lq      mean   median        uq        max neval
# bs()  3.594993  4.150932  5.002947  4.44695  4.952283   9.750284   100
# vs() 15.054460 16.217198 18.999877 17.45298 19.554958 113.623699   100

If we modify vs() to be:

vs <- function() x[x <= 0.5 & x > 0.04]

The results from two methods are the same:

identical(bs()$x, sort(vs()))
# [1] TRUE
Comments