Johnny Johansson Johnny Johansson - 2 months ago 7
R Question

"exclude" argument in `factor()` does not work

I'm a little confused about how this code should work:

foo <- factor(c("a", "b", "a", "c", "a", "a", "c", "c"))
#[1] a b a c a a c c
#Levels: a b c

factor(foo, exclude = "a")
#[1] a b a c a a c c
#Levels: a b c



Warning message:

In as.vector(exclude, typeof(x)) : NAs introduced by coercion


Shouldn't it display factor with all "a" values converted to NA values? If not, how to achieve such effect?

Answer

As I said in my comment, at the moment exclude only works for

factor(as.character(foo), exclude = "a")

rather than

factor(foo, exclude = "a")

Note, the documentation ?factor under R 3.3.1 is not satisfying at all:

exclude: a vector of values to be excluded when forming the set of
         levels.  This should be of the same type as ‘x’, and will be
         coerced if necessary.

The following are not giving any warning or error, but are also not doing anything:

## foo is a factor with `typeof` being "integer"
factor(foo, exclude = 1L)
factor(foo, exclude = factor("a", levels = levels(foo)))
#[1] a b a c a a c c
#Levels: a b c

Actually, the documentation appears quite contradictory, as it also reads:

The encoding of the vector happens as follows.  First all the
values in ‘exclude’ are removed from ‘levels’. 

so it looks like the developer really expects exclude to be a "character".


This is more likely to be a bug inside factor. The problem is rather evident, that following line inside factor(x, ...) is making a mess when input vector x is of "factor" class:

exclude <- as.vector(exclude, typeof(x))

as in that case typeof(x) is "integer". If exclude is a string, NA will be produced when trying to convert a string to an integer.

I really have no idea why there is such a line inside factor. The subsequent two lines are just doing the right thing, if this line does not exist:

    x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]

So, a remedy / fix is simply eliminating this line:

my_factor <- function (x = character(), levels, labels = levels, exclude = NA, 
                       ordered = is.ordered(x), nmax = NA) 
{
    if (is.null(x)) 
        x <- character()
    nx <- names(x)
    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        y <- as.character(y)
        levels <- unique(y[ind])
    }
    force(ordered)
    #exclude <- as.vector(exclude, typeof(x))
    x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]
    f <- match(x, levels)
    if (!is.null(nx)) 
        names(f) <- nx
    nl <- length(labels)
    nL <- length(levels)
    if (!any(nl == c(1L, nL))) 
        stop(gettextf("invalid 'labels'; length %d should be 1 or %d", 
            nl, nL), domain = NA)
    levels(f) <- if (nl == nL) 
        as.character(labels)
    else paste0(labels, seq_along(levels))
    class(f) <- c(if (ordered) "ordered", "factor")
    f
}

Let's have a test now:

my_factor(foo, exclude = "a")
#[1] <NA> b    <NA> c    <NA> <NA> c    c   
#Levels: b c

my_factor(as.character(foo), exclude = "a")
#[1] <NA> b    <NA> c    <NA> <NA> c    c   
#Levels: b c
Comments