Peter - 8 months ago 28

R Question

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.

It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:

`d[is.na(d)] <- 0`

but this is rather slow. Is there a better way to do this in R?

I am open to using other R packages.

I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.

Thanks!

As helpfully suggested, changing d to a data.matrix before applying is.na sped up the computation by several orders of magnitude

Answer

I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.

I get the following timings, with approximately 10,000 NAs:

```
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
```

So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?

I hope this helps.

Source (Stackoverflow)