nate - 11 months ago 54

R Question

I have a long vector of floats. I would like to repeatedly find subsets of that vector within various ranges. My current syntax (

`DT[x > 1.8 & x < 2.9]`

Is there a faster syntax? I have not been able to track one down.

Example:

`x = runif(1E6)`

DT = data.table(x, key = "x")

# For foverlaps()

DT[,xtemp:=x]

range = data.table(start = 0.04, end = 0.5, key=c("start", "end"))

microbenchmark::microbenchmark(

DT[x < 0.5 & x > 0.04],

x[x < 0.5 & x > 0.04],

foverlaps(DT, range, by.x = c("x", "xtemp"))

)

Unit: milliseconds

expr min lq mean median uq max neval

DT[x < 0.5 & x > 0.04] 12.65391 16.10852 18.43412 17.23268 17.76868 104.1112 100

x[x < 0.5 & x > 0.04] 16.48126 19.63882 21.65813 20.31534 20.95264 113.7965 100

foverlaps(DT, range, by.x = c("x", "xtemp")) 116.72732 131.93093 145.56821 140.09218 146.33287 226.6069 100

Edit: Added

`foverlaps()`

Edit: Here is an update to the accepted answer by Psidom. This modification handles two edge cases, where your search range extends beyond the first or last row. I found a 50x speedup with my data and this approach.

`DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; ind[1][is.na(ind[1])] = 1; ind[2][is.na(ind[2])] = nrow(DT); (ind[1]+1):ind[2]}]`

Answer Source

Based on the answer here, this seems to be some sort of improvement. values equal to 0.5 will be included in this scenario though:

```
bs <- function() DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; (ind[1]+1):ind[2]}]
vs <- function() x[x < 0.5 & x > 0.04]
x = runif(1E6)
DT = data.table(x, key = "x")
microbenchmark::microbenchmark(
bs(),
vs()
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# bs() 3.594993 4.150932 5.002947 4.44695 4.952283 9.750284 100
# vs() 15.054460 16.217198 18.999877 17.45298 19.554958 113.623699 100
```

If we modify `vs()`

to be:

```
vs <- function() x[x <= 0.5 & x > 0.04]
```

The results from two methods are the same:

```
identical(bs()$x, sort(vs()))
# [1] TRUE
```