eclairs eclairs - 2 months ago 11
R Question

deleting rows in R containing one blank column

I have a data set like:

ID EMAIL_ID.x ID.y EMAIL_ID.y
60842 k@aol.com 60842 k@aol.com
90853 a.b.c NA <NA>
90854 b.c.d NA <NA>
83907 h@gwi.net 02854 <NA>
83908 s@hotmail.com 00952 <NA>
83909 l@rediff.com 78895 <NA>


In this data, i only want to retain the rows in which both the 3rd and the 4th column are either populated or both not populated.

Hence the result should be like:

ID EMAIL_ID.x ID.y EMAIL_ID.y
60842 k@aol.com 60842 k@aol.com
90853 a.b.c NA <NA>
90854 b.c.d NA <NA>


In the above, the rows where column 3 had values but column 4 did not, have been deleted.

Please suggest how can this be done in R.

Thankyou!

Answer

One option is to check whether there are NA elements (is.na(...)) in both (&) columns or (|) both non-NA elements (by negating !) and use that logical index to subset the rows.

df1[is.na(df1[,3]) & is.na(df1[,4]) | !is.na(df1[,3]) & !is.na(df1[,4]),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  k@aol.com      60842  k@aol.com
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is to apply rowSums on the logical matrix (is.na(df1[3:4])), check if the sum of NA elements are either 0 or 2 using %in% (if it is 0 - it implies there are no NA elements and 2 means both the columns have NA) and use the logical vector to subset the rows.

df1[rowSums(is.na(df1[3:4])) %in% c(0,2),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  k@aol.com      60842  k@aol.com
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is Reduce with lapply (to avoid conversion to matrix - in case the dataset is really big)

df1[Reduce(`+`, lapply(df1[3:4], is.na)) != 1,]

NOTE: If the NA in the OP's dataset is not real NA, we need to convert it to real NA before doing this (assuming the 4th column is character class)

is.na(df1[,4]) <- df1[,4] == "<NA>"

data

df1 <- structure(list(ID = c(60842L, 90853L, 90854L, 83907L, 83908L, 
83909L), EMAIL_ID.x = c("k@aol.com", "a.b.c", "b.c.d", "h@gwi.net", 
"s@hotmail.com", "l@rediff.com"), JPNUMBER.y = c(60842L, NA, 
NA, 2854L, 952L, 78895L), EMAIL_ID.y = c("k@aol.com", NA, NA, 
NA, NA, NA)), .Names = c("ID", "EMAIL_ID.x", "JPNUMBER.y", "EMAIL_ID.y"
), row.names = c(NA, -6L), class = "data.frame")
Comments