JWilliman JWilliman - 29 days ago 7
R Question

Recode, collapse, and order factor levels using a single function with regex matching

I find manipulating factor variables in R unduly complicated. Things I frequently want to do when cleaning factors include:

  • Resorting levels – not just to set a reference category, but also put all levels in a logical (non-alphabetical order) for summary tables.
    x <- factor(x, levels = new.order)

  • Recode / rename factor levels – to simplify names and/or collapse multiple categories into one group. For one-to-one recoding
    levels(x) <- new.levels(x)
    , see here or here for examples.
    can perform several one-to-many matches in a single statement, but doesn't support regex matching.

  • Drop levels – not just drop unused levels, but set some levels to missing. (Eg. those with error codes).
    x <- factor(as.character(x), exclude = drop.levels)

  • Add levels – to show categories with zero counts.

What would be great is to have a single function that can do all of the above at once, allows fuzzy (regex) matching for recoding and dropping factors, can be used within other functions (eg.
), and has a simple (consistent) syntax.

I’ve posted my best attempt at this as an answer below, but please let me know if I've missed a function that already exists or if the code can be improved.


I've been made aware of the
package, which is subtitled Tools for working with Categorical Variables (Factors). The package has many options for resorting levels ('fct_infreq', 'fct_reorder', 'fct_relevel', ...), recoding/grouping levels ('fct_recode', 'fct_lump', 'fct_collapse'), dropping levels ('fct_recode'), and adding levels ('fct_expand'). But doesn't, as yet, support regex matching.


Here is my best attempt.

xfactor <- function(x, replace = NULL, drop = FALSE, ignore.case = FALSE, ...) {

  # Coerce to factor if not already (incorporating standard factor arguments)
  if (!is.factor(x))
    x <- factor(x, ...)

  if (!is.null(replace)) {

    # Recode factor levels
    if (!is.null(names(replace))) {
      names(replace)[names(replace) == ""] <- replace[names(replace) == ""]
      levels.tmp <- levels(x)
      for(i in seq_along(replace)) 
        levels.tmp[grepl(replace[i], levels.tmp, ignore.case = ignore.case)] <- names(replace)[i]
      levels(x) <- levels.tmp
      replace <- names(replace) 

    # Reorder factor levels 
    if (drop == TRUE)
      # Drop levels not included in replace statement
      levels.new <- replace
      # Reorder levels so those in replace statment come first
      levels.new <- c(replace, setdiff(levels(x), replace))

    levels.new <- levels(x)

  # Drop all levels listed in drop statement (converting vectors to regex expressions)  
  if (!is.logical(drop)) {
    levels.new <- levels.new[!grepl(paste(drop, collapse = "|"), levels.new)]       

  # Output factor
  return(factor(x, levels = levels.new))

Create example factor

x <- factor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
[1] "catfish" "dirt"    "dogfish" "mouse"   "rabbit" 

Factor levels can be reordered by passing an unnamed vector to the replace statement. Levels not included in the replace statement get moved to the end or dropped.

xfactor(x, replace = c("mouse", "rabbit"))
[1] dogfish rabbit  catfish mouse   dirt   
Levels: mouse rabbit catfish dirt dogfish

xfactor(x, replace = c("mouse", "rabbit"), drop = TRUE)
[1] <NA>   rabbit <NA>   mouse  <NA>  
Levels: mouse rabbit

Factor levels can be recoded, collapse, and ordered by passing a named vector to the replace statement. Where the vector names are the new factor levels and the vector values are regex expressions for the old levels. Duplicated new levels will be collapsed.

xfactor(x, replace = c("Sea" = "fish", "Land" = "rab|mou"))
[1] Sea  Land Sea  Land dirt
Levels: Sea Land dirt    

Factor levels can be dropped by passing a regex expression (or vector) to the drop statement

xfactor(x, drop = "fish")
[1] <NA>   rabbit <NA>   mouse  dirt  
Levels: dirt mouse rabbit

The function will work within other functions

df <- data.frame(n = 1:5, x)
df %>%
  mutate(y = xfactor(x, replace = c("Sea" = "fish", "Land" = "rab|mou", "Air"), drop = "di"))
  n       x    y
1 1 dogfish  Sea
2 2  rabbit Land
3 3 catfish  Sea
4 4   mouse Land
5 5    dirt <NA>