snoram snoram - 1 year ago 71
R Question

Replacing impossible values with NA using R's data.table

I have a code that replaces impossible values in a dataset with NA.

I'm trying to convert the code to being based on

, as an example, I replace height of
with height

(Dummy) data

DT <- data.table(id = 1:5e6,
height = sample(c(0, 100:240), 5e6, replace = TRUE))

My current solution is slower and at least as verbose as my data.frame version. I assume I am doing something wrong...

DT[height == 0, height := NA]

While researching this question I found another solution which is much faster (but uglier).

set(DT, which("height"==0), "height", value = NA)

All suggestions appreciated.

Answer Source

Since v1.9.4, data.table by default automatically creates an index on columns during subsets of the form x == val and x %in% val used within [.data.table call. This makes subsequent subsetting very fast with only a slightly higher price to pay on the first subset (since data.table's radix ordering is quite fast). The first subset could be slower because it is the time to:

  1. create the index

  2. and then subset.

To illustrate this (using @akrun's data):

getOption("") # [1] TRUE ===> enabled

DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))

system.time(DT[height == 0L])
#   0.396   0.059   0.452 ## first run
#   0.003   0.000   0.004 ## second run is very fast

Now if we disable auto indexing:

options( = FALSE)
getOption("") # [1] FALSE

DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))

system.time(DT[height == 0L])
#   0.037   0.007   0.042 ## first run
#   0.039   0.010   0.045 ## second run (~ 10x slower than 2nd run above)

options( = TRUE) # restore auto indexing if necessary

But your case is special because, you update the same column you subset. In essence, this is what is happening:

  1. The i expression is seen to be an expression that can be optimised for auto indexing. An index is created and saved for blazing fast subsets later on.

  2. The j expression is seen and the column is updated.

  3. The column on which the index has been set has been updated. So index is removed.

Auto indexing logic should detect this and skip creating the index altogether if any of the rows evaluate to TRUE, since the created index is essentially useless.

Could you please file an issue on the project issues page? Just linking to this SO Q should be sufficient.

To answer your Q, disable auto indexing and run the subset, and it should be more or less equal to the time you get with set().

Base R solution just can not be faster here since it copies to entire column just to update those entries. But it is because base R chose to do that.