Georg Heiler - 2 months ago 17

R Question

How can I use dummy vars in caret without destroying my target variable?

`set.seed(5)`

data <- ISLR::OJ

data<-na.omit(data)

dummies <- dummyVars( Purchase ~ ., data = data)

data2 <- predict(dummies, newdata = data)

split_factor = 0.5

n_samples = nrow(data2)

train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))

train <- data2[train_idx, ]

test <- data2[-train_idx, ]

modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)

will fail, as the Purchase variable is missing. In case I replace it with

`data$Purchase <- ifelse(data$Purchase == "CH",1,0)`

Answer

At least the example code seems to have a few issues indicated in the comments below. To answer your questions:

- The result of
`ifelse`

is an integer vector, not a factor, so the train function defaults to regression - Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula

To avoid these problems, check the `class`

of your objects carefully

```
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
dummies <- dummyVars(~., data = data[,-1])
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
# Option 1 (as asked): Specify independent and dependent variables separately
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
data2$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = data2, method='lda',preProcess=c('scale', 'center'))
```