user3878084 user3878084 - 14 days ago 11
R Question

replace NA or else <NA> with something or something else in column of data frame

I have read what seem to be related posts, but am evidently much too much a Noob to understand or make anything work...

> df
ID Area Address
1 NA 1 lane
2 11 NA road
3 12 2 blvd
4 13 5 <NA>

> str(df)
'data.frame': 4 obs. of 3 variables:
$ ID : int NA 11 12 13
$ Area : int 1 NA 2 5
$ Address: Factor w/ 3 levels "blvd","lane",..: 2 3 1 NA


I want to be able -- not just for the data frame above, but for larger data frames having many more rows and many more columns -- to replace in whatever columns I choose (which I reference by column names) all occurences of

<NA>


with an element of my choosing from

<NA> , NA, "foo", "", 0


and whatever performs the replacement does not break or give error(s) when there is no

<NA>


to replace. Likewise, I want to perform an analogous replacement for

NA


in whatever columns I choose without breakage or errors.

If there are technical reasons as to why I cannot do what I propose, then what can I do to come as close as possible to the above (while sticking to data frames -- converting to and fro with something else is o.k. if the answer is very explicit as to how exactly to manage the conversions -- and preserving factors in the sense that, for example, the Address column is a factor so after the replacement it should still be a factor).

I expect there are technical reasons as to why I cannot do what I propose (I am confused to the point of asking the impossible), so I am hoping to come as close as reality permits, and that some kind soul will explain the extent to which I can come close to the above as well as how exactly to get however close is possible.

Please help (do not assume I can possibly understand without a detailed explicit answer).

Thanks

Answer

A character string cannot be inserted into a numeric or integer vector without making the entire vector character but we can insert a zero in place of NA and we do that below. Also we insert fill having default "foo" as a new level in place of NA for factors of the sort shown in the question.

1) Looking at df.orig shown reproducibly at the end it has integer and factor columns and the following works for those as well as numeric columns which are double. You will need to extend this if you want to convert other classes not shown in the question.

df <- df.orig

isNum <- sapply(df, is.numeric)
na2zero <- function(v, ...) replace(v, is.na(v), 0L)
df[isNum] <- lapply(df[isNum], na2zero)

isFactor <- sapply(df, is.factor)
na2fill <- function(v, fill = "foo", ...) { 
      v <- addNA(v)
      levels(v)[nlevels(v)] <- fill
      v 
}
df[isFactor] <- lapply(df[isFactor], na2fill)

giving:

> df
  ID Area Address
1  0    1    lane
2 11    0    road
3 12    2    blvd
4 13    5     foo

2) Alternatley, we could use S3 to do it more compactly where na2zero and na2fill are from (1).

rmNA <- function(v, ...) UseMethod("rmNA")
rmNA.numeric <- na2zero
rmNA.factor <- na2fill
rmNA.default <- function(x, ...) x # do not process other classes

df <- df.orig
df[] <- lapply(df, rmNA)

Note: df in reproducible form is:

df.orig <- 
structure(list(ID = c(NA, 11L, 12L, 13L), Area = c(1L, NA, 2L, 
5L), Address = structure(c(2L, 3L, 1L, NA), .Label = c("blvd", 
"lane", "road"), class = "factor")), .Names = c("ID", "Area", 
"Address"), class = "data.frame", row.names = c("1", "2", "3", 
"4"))