Lennie Lennie - 3 months ago 15
R Question

How to subset multiple rows from data.table fast on big data

I'm working with R-Studio and want to get multiple rows from a data.table.

Let's say I have a data.table with the following data:

Date Column 1
"01.02.2016 10:00:00 CEST" 10
"01.02.2016 10:01:00 CEST" 12
"01.02.2016 10:02:00 CEST" 13
"01.02.2016 10:03:00 CEST" 11
"01.02.2016 10:04:00 CEST" 17


and I want to get the values from "01.02.2016 10:00:30" to "01.02.2016 10:02:30" like this:

Date Column 1
"01.02.2016 10:01:00 CEST" 12
"01.02.2016 10:02:00 CEST" 13


at the moment I achieve this by doing this:

x <- data.table[Date >= "01.02.2016 10:00:30 CEST" & Date <= "01.02.2016 10:02:30 CEST"]


But this is far too slow for me, because on a data.table with 600k rows it takes about 0.4 seconds.

Instead this is much much faster:

setkey(data.table, Date)
x <- prozessdaten.data.table[J(c("01.02.2016 10:01:00 CEST", "01.02.2016 10:02:00 CEST"))]


My Question is there a possibility to use the binary Search function J() with a specified time range and not exact values?

Answer

data.table v1.9.7+ has implemented non-equi joins and added a new function inrange which uses this new feature and can achieve what you want

## Loading data
library(data.table) #v 1.9.7+
DT <- data.table(date = c('01.02.2016 10:00:00','01.02.2016 10:01:00',
                          '01.02.2016 10:02:00','01.02.2016 10:03:00',
                          '01.02.2016 10:04:00'),
                 column1 = c(10, 12, 13, 11, 17))

## Converting to POSIXct class
DT[, date := as.POSIXct(date, format = "%d.%m.%Y %H:%M:%S")]

## Validating that forder kicks in
options(datatable.verbose = TRUE)
DT[date %inrange% as.POSIXct(c("2016-02-01 10:00:30", "2016-02-01 10:02:30"))]
# forder took 0 sec
# x is already ordered by these columns, no need to call reorder
#                   date column1
# 1: 2016-02-01 10:01:00      12
# 2: 2016-02-01 10:02:00      13

Though you should be aware that since data.table 1.9.4 secondary keys were implemented, meaning that for some variations of vector scans, after the first run a key is being added and from now on, even operations such as == and %in% are using bmerge. This doesn't seem to work on POSIXct class but you can observe this behavior on you numeric column column1

## Running for first time
options(datatable.verbose = TRUE)
DT[column1 == 10]
# Creating new index 'column1'
# forder took 0 sec
# Starting bmerge ...done in 0 secs
#                   date column1
# 1: 2016-02-01 10:00:00      10

## Running for second time and on
DT[column1 == 10]
# Using existing index 'column1'
# Starting bmerge ...done in 0 secs
#                   date column1
# 1: 2016-02-01 10:00:00      10

According to their wiki I belive this is going to be implemented for non-equi joins too starting from v1.9.8