JD Long JD Long - 2 months ago 11
R Question

Factors in R: more than an annoyance?

One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.

Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?

Answer

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

> lm(Petal.Length ~ -1 + Species, data=iris)

Call:
lm(formula = Petal.Length ~ -1 + Species, data = iris)

Coefficients:
    Speciessetosa  Speciesversicolor   Speciesvirginica  
            1.462              4.260              5.552  

> iris.alt <- iris
> iris.alt$Species <- as.character(iris.alt$Species)
> lm(Petal.Length ~ -1 + Species, data=iris.alt)

Call:
lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

Coefficients:
    Speciessetosa  Speciesversicolor   Speciesvirginica  
            1.462              4.260              5.552  

Warning message:
In model.matrix.default(mt, mf, contrasts) :
  variable 'Species' converted to a factor
> 

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

> s <- iris$Species
> s[s == 'setosa', drop=TRUE]
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa
> s[s == 'setosa', drop=FALSE]
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
> 

However, with dataframes, the behavior of [.data.frame() is different: see this email or ?[.data.frame (in backticks, which StackOverflow won't let me escape). Using drop=TRUE on dataframes does not work as you'd imagine:

> x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
> x$Species
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
> 

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data frame (since R 2.12):

> x <- subset(iris, Species == 'setosa')
> levels(x$Species)
[1] "setosa"     "versicolor" "virginica" 
> x <- droplevels(x)
> levels(x$Species)
[1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

Comments