rg6 rg6 - 6 days ago 7
R Question

Custom data-dependent recoding to logicals in R

I have two data frames,

data
and
meta
. Some, but not all, columns in
data
are logical values, but they are coded in many different ways. The rows in
meta
describe the columns in
data
, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.

I need a procedure that replaces all
data
values in conceptually logical columns with the appropriate logical values from the codes in the corresponding
meta
row. Any
data
values in a conceptually logical column that do not match a value in the corresponding
meta
row should become NA.

Small toy example for
meta
:

name type false true
-----------------------------------------
a.char.var char NA NA
a.logical.var logical NA 7
another.logical.var logical 1 0
another.char.var char NA NA


Small toy example for
data
:

a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa 7 0 ba
ab NA 1 bb
ac 7 NA bc
ad 4 3 bd


Small toy example output:

a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa TRUE TRUE ba
ab FALSE FALSE bb
ac TRUE NA bc
ad NA NA bd


I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.

Answer

First we set up the data

meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
                   type=c('char', 'logical', 'logical', 'char'),
                   false=c(NA, NA, 1, NA),
                   true=c(NA, 7, 0, NA), stringsAsFactors = F)

data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
                   a.logical.var=c(7, NA, 7, 4),
                   another.logical.var=c(0,1,NA,3),
                   another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)

Then we subset out just the logical columns. We will iterate through these, using the name column to select the relevant column in data, and change values in data_out from an initialized NA to either T or F according to matching values in data.

Note that data[,logical_meta$name[1]] is equivalent to data[,'a.logical.var'] or data$a.logical.var, if logical_meta$name is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F) we need to convert to character at which point we might as well give it a name - colname below.

Having NAs to contend with means using which is advantageous: c(0, 1,NA,3)==0 returns T,F,NA,F but which then ignores the NA and returns just the position 1. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which eliminates this.

logical_meta <- meta[meta$type=='logical',]

data_out <- data #initialize


for(i in 1:nrow(logical_meta)) {
  colname <- as.character(logical_meta$name[i]) #only need as.character if factor
  data_out[,colname] <- NA
  #false column first
  if(is.na(logical_meta$false[i])) {
    data_out[is.na(data[,colname]),colname] <- FALSE
  } else {
    data_out[which(data[,colname]==logical_meta$false[i]),
             colname] <- FALSE
  }
  #true column next
  if(is.na(logical_meta$true[i])) {
    data_out[is.na(data[,colname]),colname] <- TRUE
  } else {
    data_out[which(data[,colname]==logical_meta$true[i]),
             colname] <- TRUE
  }
}

data_out