Thomas Thomas - 15 days ago 7
R Question

How to omit rows with NA in only two columns in R?

I want to omit rows where

NA
appears in both of two columns.

I'm familiar with
na.omit
,
is.na
, and
complete.cases
, but can't figure out how to use these to get what I want. For example, I have the following dataframe:

(df <- structure(list(x = c(1L, 2L, NA, 3L, NA),
y = c(4L, 5L, NA, 6L, 7L),
z = c(8L, 9L, 10L, 11L, NA)),
.Names = c("x", "y", "z"),
class = "data.frame",
row.names = c(NA, -5L)))
x y z
1 4 8
2 5 9
NA NA 10
3 6 11
NA 7 NA


and I want to remove only those rows where
NA
appears in both the
x
and
y
columns (excluding anything in z), to give

x y z
1 4 8
2 5 9
3 6 11
NA 7 NA


Does anyone know an easy way to do this? Using
na.omit
,
is.na
, or
complete.cases
is not working.

Answer
df[!with(df,is.na(x)& is.na(y)),]
#      x y  z
#1  1 4  8
#2  2 5  9
#4  3 6 11
#5 NA 7 NA

I did benchmarked on a slightly bigger dataset. Here are the results:

set.seed(237)
df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T)) 

f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
f3 <- function()  df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] 

library(microbenchmark)

microbenchmark(f1(), f2(), f3(), unit="relative")
Unit: relative
#expr       min        lq    median        uq       max neval
# f1()  1.000000  1.000000  1.000000  1.000000  1.000000   100
# f2()  1.044812  1.068189  1.138323  1.129611  0.856396   100
# f3() 26.205272 25.848441 24.357665 21.799930 22.881378   100