Leo Leo - 2 years ago 75
R Question

Convert data frame columns to factor with indexing

I have some results that I put in a data frame. I have some factor columns and many numeric columns. I can easily convert the numeric columns to numeric with indexing, as per the answer to this question.

#create example data
df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]
df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

#find columns that are factors
factornames = c("X1", "X2", "X3")
factorfilt = names(df) %in% factornames

#convert non-factor columns to numeric
df[, !factorfilt] = as.numeric(as.character(unlist(df[, !factorfilt])))


But when I want to do the same for my factor columns, I cant get the same indexing to work:

#convert factor columns to factor
df[, factorfilt] = as.factor(as.character(unlist(df[, factorfilt])))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(as.character(df[, factorfilt]))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(unlist(df[, factorfilt]))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(df[, factorfilt])

Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?


All of these return
"character"
if I call
class(df$X1)
, while if I run
df$X1= as.factor(df$X1)
it returns
"factor"
.

Why does indexing this way not work when I call
as.factor
, but does if I call
as.numeric
?

Answer Source

You should observe some behavioral aspects of what you are doing. Defining your data as you did:

df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]
df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

factornames = c("X1", "X2", "X3")
factorfilt = names(df) %in% factornames
df[, !factorfilt] = as.numeric(as.character(unlist(df[, !factorfilt])))

Now let's take a look at the result of making the X1, X2, and X3 factors as you did, but let's not reassign it yet.

test <- as.factor(as.character(df[, factorfilt]))
class(test) # "factor"
length(test) # 3

The important thing to notice here is that test is not a data frame. It's a vector, that you are attempting to save over three columns of a data frame. I think we should question the wisdom of converting a data frame to a vector to store in a data frame.

Then consider your second assignment:

test2 <- as.factor(as.character(unlist(df[, factorfilt])))
class(test2) # factor
length(test2) # 3000

Again, it's a factor, but it has a completely different length than test. R is being kind by letting you reassign this back into df at all, and is only doing so because it recognizes that it can reconcile the dimensions. But when you try to push the factors into X1, X2, and X3, there's a big question about what to do with the factor levels. Should all three variables have the same levels? Should each variable only have the levels present within itself? Instead of attempting to declare what the "appropriate" choice is, R just ignores it and converts it back to a character for you to deal with on your own.

The fact that manipulating columns this way has the potential to change classes unexpectedly is a good reason not to do it. This is evident in your assignment of the NAs. Let's revisit:

df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]

At this point, X4 through X1000 are all integer class columns. When you run

df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

They are all now characters, and you proceed to convert them to numeric. They aren't even their original class anymore.

If, instead, we use lapply

df[-1] <- lapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

the original classes are preserved and there's no need to convert them back to a numeric class. Similarly, we can readily convert X1 through X3 to factors with

df[, factorfilt] <- lapply(df[, factorfilt], as.factor)

As a general rule, it is better to manipulate the data in columns as distinct columns. Once you begin assigning a single vector over multiple columns, you enter a dark world of mischief.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download