ftxx - 1 year ago 81
R Question

fastest way to count the number of rows in a data frame that has at least one NA

When you have the data set, usually you want to see that is the fraction of rows that has at least one NA (or missing value) in the data set.

In R, what I did is the following:

``````TR = apply(my_data,1,anyNA)
sum(TR)/length(TR)
``````

But I found that if my data set has 1 million rows, it takes some time. I wonder if there is a fastest way to achieve this goal in R?

Before I begin, note that none of the code here is mine. I was merely fascinated by the code in the comments and wondered which one really performed the best.

I suspected some of the time was being absorbed in transforming a data frame to a matrix for `apply` and `rowSums`, so I've also done most of the solutions on matrices to illustrate the penalty applied by running these solutions on a data frame.

``````# Make a data frame of 10,000 rows and set random values to NA

library(dplyr)
set.seed(13)
MT <- mtcars[sample(1:nrow(mtcars), size = 10000, replace = TRUE), ]

MT <- lapply(MT,
function(x) { x[sample(1:length(x), size = 100)] <- NA; x }) %>%
bind_cols()

MT_mat <- as.matrix(MT)

library(microbenchmark)
microbenchmark(
apply(MT,1,anyNA),
apply(MT_mat,1,anyNA),  # apply on a matrix
row_sum = rowSums(is.na(MT)) > 0,
row_sum_mat = rowSums(is.na(MT_mat)), # rowSums on a matrix
reduce = Reduce('|', lapply(MT, is.na)) ,
complete_case = !complete.cases(MT),
complete_case_mat = !complete.cases(MT_mat) # complete.cases on a matrix
)

Unit: microseconds
expr       min        lq       mean     median         uq       max neval  cld
apply(MT, 1, anyNA) 12126.013 13422.747 14930.6022 13927.5695 14589.1320 60958.791   100    d
apply(MT_mat, 1, anyNA) 11662.390 12546.674 14758.1266 13336.6785 14083.7225 66075.346   100    d
row_sum  1541.594  1581.768  2233.1150  1617.3985  1647.8955 49114.588   100  bc
row_sum_mat   579.161   589.131   707.3710   618.7490   627.5465  3235.089   100 a c
reduce  2028.969  2051.696  2252.8679  2084.8320  2102.8670  4271.127   100   c
complete_case   321.984   330.195   346.8692   342.5115   351.3090   436.057   100 a
complete_case_mat   348.083   358.640   384.1671   379.0205   406.8790   503.503   100 ab

#* Verify that they all return the same result
MT\$apply <- apply(MT, 1, anyNA)
MT\$apply_mat <- apply(MT_mat, 1, anyNA)
MT\$row_sum <- rowSums(is.na(MT)) > 0
MT\$row_sum_mat <- rowSums(is.na(MT_mat)) > 0
MT\$reduce <- Reduce('|', lapply(MT, is.na))
MT\$complete_case <- !complete.cases(MT)
MT\$complete_case_mat <- !complete.cases(MT_mat)

all(MT\$apply == MT\$apply_mat)
all(MT\$apply == MT\$row_sum)
all(MT\$apply == MT\$row_sum_mat)
all(MT\$apply == MT\$reduce)
all(MT\$apply == MT\$complete_case)
all(MT\$apply == MT\$complete_case_mat)
``````

`complete.cases` seems to be the clear winner, and works well for both data frames and matrices. As it turns out, `complete.cases` calls a C routine, which may account for much of its speed. looking at `rowSums`, `apply`, and `Reduce` shows R code.

Why `apply` is slower the `rowSums` probably has to do with `rowSums` being optimized for a specific task. `rowSums` knows it will be returning a numeric, `apply` has no such guarantee. I doubt that accounts for all of the difference--I'm mostly speculating.

I couldn't begin to tell you how `Reduce` is working.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download