I have a code that replaces impossible values in a dataset with NA.
I'm trying to convert the code to being based on
DT <- data.table(id = 1:5e6,
height = sample(c(0, 100:240), 5e6, replace = TRUE))
DT[height == 0, height := NA]
set(DT, which("height"==0), "height", value = NA)
v1.9.4, data.table by default automatically creates an index on columns during subsets of the form
x == val and
x %in% val used within
[.data.table call. This makes subsequent subsetting very fast with only a slightly higher price to pay on the first subset (since data.table's radix ordering is quite fast). The first subset could be slower because it is the time to:
create the index
and then subset.
To illustrate this (using @akrun's data):
require(data.table) getOption("datatable.auto.index") #  TRUE ===> enabled set.seed(24) DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE)) system.time(DT[height == 0L]) # 0.396 0.059 0.452 ## first run # 0.003 0.000 0.004 ## second run is very fast
Now if we disable auto indexing:
require(data.table) options(datatable.auto.index = FALSE) getOption("datatable.auto.index") #  FALSE set.seed(24) DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE)) system.time(DT[height == 0L]) # 0.037 0.007 0.042 ## first run # 0.039 0.010 0.045 ## second run (~ 10x slower than 2nd run above) options(datatable.auto.index = TRUE) # restore auto indexing if necessary
But your case is special because, you update the same column you subset. In essence, this is what is happening:
i expression is seen to be an expression that can be optimised for auto indexing. An index is created and saved for blazing fast subsets later on.
j expression is seen and the column is updated.
The column on which the index has been set has been updated. So index is removed.
Auto indexing logic should detect this and skip creating the index altogether if any of the rows evaluate to
TRUE, since the created index is essentially useless.
Could you please file an issue on the project issues page? Just linking to this SO Q should be sufficient.
To answer your Q, disable auto indexing and run the subset, and it should be more or less equal to the time you get with
Base R solution just can not be faster here since it copies to entire column just to update those entries. But it is because base R chose to do that.