Ari B. Friedman Ari B. Friedman - 3 months ago 18
R Question

Idiom for ifelse-style recoding for multiple categories

I run across this often enough that I figure there has to be a good idiom for it. Suppose I have a data.frame with a bunch of attributes, including "product." I also have a key which translates products to brand + size. Product codes 1-3 are Tylenol, 4-6 are Advil, 7-9 are Bayer, 10-12 are Generic.

What's the fastest (in terms of human time) way to code this up?

I tend to use nested

ifelse
's if there are 3 or fewer categories, and type out the data table and merge it in if there are more than 3. Any better ideas? Stata has a
recode
command
that is pretty nifty for this sort of thing, although I believe it promotes data-code intermixing a little too much.

dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L,
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA,
-20L), class = "data.frame")

Answer

One could use a list as an associative array to define the brand -> product code mapping, i.e.:

brands <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)

Once you have this, you can then either invert this to create a product code -> brand list (could take a lot of memory), or just use a search function:

find.key <- function(x, li, default=NA) {
    ret <- rep.int(default, length(x))
    for (key in names(li)) {
        ret[x %in% li[[key]]] <- key
    }
    return(ret)
}

I'm sure there are better ways of writing this function (the for loop is annoying me!), but at least it is vectorised, so it only requires a single pass through the list.

Using it would be something like:

> dat$brand <- find.key(dat$product, brands)
> dat
   product   brand
1       11 Generic
2       11 Generic
3        9   Bayer
4        9   Bayer
5        6   Advil
6        1 Tylenol
7       11 Generic
8        5   Advil
9        7   Bayer
10      11 Generic
11       5   Advil
12      11 Generic
13       4   Advil
14       3 Tylenol
15      10 Generic
16       7   Bayer
17      10 Generic
18       5   Advil
19       9   Bayer
20       8   Bayer

The recode and levels<- solutions are very nice, but they are also significantly slower than this one (and once you have find.key this is easier-for-humans than recode and on par with the levels<-):

> microbenchmark(
     recode=recode(dat$product,recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'"), 
     find.key=find.key(dat$product, brands),
     levels=`levels<-`(factor(dat$product),brands))
Unit: microseconds
      expr      min        lq    median        uq      max
1 find.key   64.325   69.9815   76.8950   83.8445  221.748
2   levels  240.535  248.1470  274.7565  306.8490 1477.707
3   recode 1636.039 1683.4275 1730.8170 1855.8320 3095.938

(I can't get the switch version to benchmark properly, but it appears to be faster than all of the above, although it is even worse-for-humans than the recode solution.)