Georg Heiler Georg Heiler - 17 days ago 5
R Question

caret dummy-vars exclude target

How can I use dummy vars in caret without destroying my target variable?

set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)

dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)


will fail, as the Purchase variable is missing. In case I replace it with
data$Purchase <- ifelse(data$Purchase == "CH",1,0)
beforehand caret complains that this no longer is a classification but a regression problem

Answer

At least the example code seems to have a few issues indicated in the comments below. To answer your questions:

  • The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
  • Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula

To avoid these problems, check the class of your objects carefully

set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)

# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)

# See help for dummyVars. The function does not take a dependent variable and predict will give an error
dummies <- dummyVars(~., data = data[,-1]) 
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))

split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))

train <- data2[train_idx, ]
test <- data2[-train_idx, ]

# Option 1 (as asked): Specify independent and dependent variables separately
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))

# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
data2$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = data2, method='lda',preProcess=c('scale', 'center'))
Comments