ftxx - 6 months ago 28

R Question

When you have the data set, usually you want to see that is the fraction of rows that has at least one NA (or missing value) in the data set.

In R, what I did is the following:

`TR = apply(my_data,1,anyNA)`

sum(TR)/length(TR)

But I found that if my data set has 1 million rows, it takes some time. I wonder if there is a fastest way to achieve this goal in R?

Answer

Before I begin, note that none of the code here is mine. I was merely fascinated by the code in the comments and wondered which one really performed the best.

I suspected some of the time was being absorbed in transforming a data frame to a matrix for `apply`

and `rowSums`

, so I've also done most of the solutions on matrices to illustrate the penalty applied by running these solutions on a data frame.

```
# Make a data frame of 10,000 rows and set random values to NA
library(dplyr)
set.seed(13)
MT <- mtcars[sample(1:nrow(mtcars), size = 10000, replace = TRUE), ]
MT <- lapply(MT,
function(x) { x[sample(1:length(x), size = 100)] <- NA; x }) %>%
bind_cols()
MT_mat <- as.matrix(MT)
library(microbenchmark)
microbenchmark(
apply(MT,1,anyNA),
apply(MT_mat,1,anyNA), # apply on a matrix
row_sum = rowSums(is.na(MT)) > 0,
row_sum_mat = rowSums(is.na(MT_mat)), # rowSums on a matrix
reduce = Reduce('|', lapply(MT, is.na)) ,
complete_case = !complete.cases(MT),
complete_case_mat = !complete.cases(MT_mat) # complete.cases on a matrix
)
Unit: microseconds
expr min lq mean median uq max neval cld
apply(MT, 1, anyNA) 12126.013 13422.747 14930.6022 13927.5695 14589.1320 60958.791 100 d
apply(MT_mat, 1, anyNA) 11662.390 12546.674 14758.1266 13336.6785 14083.7225 66075.346 100 d
row_sum 1541.594 1581.768 2233.1150 1617.3985 1647.8955 49114.588 100 bc
row_sum_mat 579.161 589.131 707.3710 618.7490 627.5465 3235.089 100 a c
reduce 2028.969 2051.696 2252.8679 2084.8320 2102.8670 4271.127 100 c
complete_case 321.984 330.195 346.8692 342.5115 351.3090 436.057 100 a
complete_case_mat 348.083 358.640 384.1671 379.0205 406.8790 503.503 100 ab
#* Verify that they all return the same result
MT$apply <- apply(MT, 1, anyNA)
MT$apply_mat <- apply(MT_mat, 1, anyNA)
MT$row_sum <- rowSums(is.na(MT)) > 0
MT$row_sum_mat <- rowSums(is.na(MT_mat)) > 0
MT$reduce <- Reduce('|', lapply(MT, is.na))
MT$complete_case <- !complete.cases(MT)
MT$complete_case_mat <- !complete.cases(MT_mat)
all(MT$apply == MT$apply_mat)
all(MT$apply == MT$row_sum)
all(MT$apply == MT$row_sum_mat)
all(MT$apply == MT$reduce)
all(MT$apply == MT$complete_case)
all(MT$apply == MT$complete_case_mat)
```

`complete.cases`

seems to be the clear winner, and works well for both data frames and matrices. As it turns out, `complete.cases`

calls a C routine, which may account for much of its speed. looking at `rowSums`

, `apply`

, and `Reduce`

shows R code.

Why `apply`

is slower the `rowSums`

probably has to do with `rowSums`

being optimized for a specific task. `rowSums`

knows it will be returning a numeric, `apply`

has no such guarantee. I doubt that accounts for all of the difference--I'm mostly speculating.

I couldn't begin to tell you how `Reduce`

is working.