Adam Liter Adam Liter - 26 days ago 10
R Question

R: Accidentally subsetting a data frame using a factor column as if it were logical

I inherited some legacy R code to work with that was recoding some values in a column on the basis of a value in some other column in that same row that was mistakenly thought to be a boolean value when, in reality, those values were actually (strings being converted to) factors, like so:

df <- data.frame(value = c(1, 2, 3, 4, 5, 6),
reversed = c("true", "false",
"true", "true",
"false", "false"))

str(df)
#> 'data.frame': 6 obs. of 2 variables:
#> $ value : num 1 2 3 4 5 6
#> $ reversed: Factor w/ 2 levels "false","true": 2 1 2 2 1 1

df$recoded_value <- df$value
df$recoded_value[df$reversed] <- 7 - df$recoded_value[df$reversed]


If you inspect the results, this produces unintended results.
df[2, "recoded_value"]
is 5, but the intent is for it to be 2. Moreover,
df[3, "recoded_value"]
is 3, but the intent is for it to be 4.

I would like to understand what is going on here. My first hypothesis was that R was treating one factor level as
TRUE
and the other as
FALSE
. But this is obviously not the case because identical factor levels are not being treated identically:

df[c(1,3), ]
#> value reversed recoded_value
#> 1 1 true 6
#> 3 3 true 3

df[c(2,5), ]
#> value reversed recoded_value
#> 2 2 false 5
#> 5 5 false 5


What is going on here?

To clarify: I'm not interested in solutions to the problem. I know how to fix the code to produce the intended results. I would like to understand:


  1. Why does this code work at all? How can you subset on the basis of a factor column? What is
    `[`
    doing to even allow this?

  2. Why are the things that are the same value (i.e., same level of a factor) being treated differently?


Answer

As it is mentioned in the post, reversed is a factor and not a logical vector. In R, TRUE/FALSE values are the logical, so convert to logical vector

df$reversed <- df$reversed=="true"

Regarding why we have unexpected output (from the OP's code),

df$reversed
#[1] true  false true  true  false false
#Levels: false true

the levels are in alphabetic order and the storage mode of factor is integer i.e.

as.integer(df$reversed)
#[1] 2 1 2 2 1 1

So when we subset the 'recoded_value' using the 'reversed', it will subset based on the numeric index

df$recoded_value[df$reversed]
#[1] 2 1 2 2 1 1

i.e. the first value in output is the second observation of 'recoded_value' and the second 1st observation and so on, instead if we use the correct logical index

df$recoded_value[df$reversed=="true"]
#[1] 1 3 4

Let's check how this will behave with the changed 'reversed'

df$reversed <- df$reversed=="true"
df$recoded_value[df$reversed] <- 7 - df$recoded_value[df$reversed]
df[c(1,3), ]
#  value reversed recoded_value
#1     1     TRUE             6
#3     3     TRUE             4
df[c(2,5),]
#  value reversed recoded_value
#2     2    FALSE             2
#5     5    FALSE             5