Adam Liter - 1 year ago 62
R Question

# R: Accidentally subsetting a data frame using a factor column as if it were logical

I inherited some legacy R code to work with that was recoding some values in a column on the basis of a value in some other column in that same row that was mistakenly thought to be a boolean value when, in reality, those values were actually (strings being converted to) factors, like so:

``````df <- data.frame(value = c(1, 2, 3, 4, 5, 6),
reversed = c("true", "false",
"true", "true",
"false", "false"))

str(df)
#> 'data.frame':    6 obs. of  2 variables:
#>  \$ value   : num  1 2 3 4 5 6
#>  \$ reversed: Factor w/ 2 levels "false","true": 2 1 2 2 1 1

df\$recoded_value <- df\$value
df\$recoded_value[df\$reversed] <- 7 - df\$recoded_value[df\$reversed]
``````

If you inspect the results, this produces unintended results.
`df[2, "recoded_value"]`
is 5, but the intent is for it to be 2. Moreover,
`df[3, "recoded_value"]`
is 3, but the intent is for it to be 4.

I would like to understand what is going on here. My first hypothesis was that R was treating one factor level as
`TRUE`
and the other as
`FALSE`
. But this is obviously not the case because identical factor levels are not being treated identically:

``````df[c(1,3), ]
#>   value reversed recoded_value
#> 1     1     true             6
#> 3     3     true             3

df[c(2,5), ]
#>   value reversed recoded_value
#> 2     2    false             5
#> 5     5    false             5
``````

What is going on here?

To clarify: I'm not interested in solutions to the problem. I know how to fix the code to produce the intended results. I would like to understand:

1. Why does this code work at all? How can you subset on the basis of a factor column? What is
``[``
doing to even allow this?

2. Why are the things that are the same value (i.e., same level of a factor) being treated differently?

As it is mentioned in the post, `reversed` is a `factor` and not a `logical` vector. In `R`, `TRUE/FALSE` values are the logical, so convert to `logical` vector

``````df\$reversed <- df\$reversed=="true"
``````

Regarding why we have unexpected output (from the OP's code),

``````df\$reversed
#[1] true  false true  true  false false
#Levels: false true
``````

the `levels` are in alphabetic order and the storage mode of `factor` is `integer` i.e.

``````as.integer(df\$reversed)
#[1] 2 1 2 2 1 1
``````

So when we subset the 'recoded_value' using the 'reversed', it will subset based on the numeric index

``````df\$recoded_value[df\$reversed]
#[1] 2 1 2 2 1 1
``````

i.e. the first value in output is the second observation of 'recoded_value' and the second 1st observation and so on, instead if we use the correct logical index

``````df\$recoded_value[df\$reversed=="true"]
#[1] 1 3 4
``````

Let's check how this will behave with the changed 'reversed'

``````df\$reversed <- df\$reversed=="true"
df\$recoded_value[df\$reversed] <- 7 - df\$recoded_value[df\$reversed]
df[c(1,3), ]
#  value reversed recoded_value
#1     1     TRUE             6
#3     3     TRUE             4
df[c(2,5),]
#  value reversed recoded_value
#2     2    FALSE             2
#5     5    FALSE             5
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download