Nestorghh Nestorghh - 22 days ago 22
R Question

Is this stratified k-CV with caret?

I want to know how to program stratified k-CV by only using the caret package in R. See the example that follows:

library(mlbench)
library(caret)

data(Sonar)

set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]


folds <- createFolds(factor(training$Class), k = 10, list = TRUE)

fitControl <- trainControl(## 10-fold CV
method = "cv",
indexOut=folds,
savePredictions="all")

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)

d=gbmFit1$pred


Note that I am not specifying
index
but only
indexOut
. Does caret train the model with the complement of
IndexOut
each time? By inspecting
d
, I could see that the rowIndex matches the definition of each fold, but how can I confirm that the training set each time is the complement of the elements in fold i?

Answer

I found this interesting because I use caret all the time and had never thought about this direct questions about index and indexOut. The help docs under ?trainControl say that indexOut if NULL will contain the unique set of samples not contained in index but doesn't state for the other way around. So I dug into train.default to find out what was going on. When you assign

fitControl = trainControl(..., indexOut = ...)

you can assert for yourself that fitControl$index == NULL. In the code for train.default there is a line (line 109 of function definition) which checks this condition and then uses (for "cv") createFolds with the argument returnTrain = TRUE. It does this without checking what you have set for indexOut.

There appears to be no other code relevant to index and indexOut within train.default for this particular scenario. Which suggest nothing guarantees that there is no intersection between index$Fold01 and indexOut$Fold01.

We could examine this further

intersect(x$control$index$Fold01,x$control$indexOut$Fold01)
## [1]  12  18  33  34  53  58  67  95 109 111 115 120 137 143 156

which came from running your exact code in the question. So it would seem that index and indexOut are not a perfect complement of one another.

I would suggest the safest way going forward would be to specify index rather than indexOut to get the desired effect.

Comments