mlal mlal - 4 months ago 50
R Question

Does predict function in caret package use future information when preprocessing?

My question is pretty simple but I can't find a clear cut answer using caret package doc.
If I use the preprocessing options center and scale in my train function, it is stated that the same preprocesing will be applied to new data set while doing predictions.

So when I use the predict function:
Does it mean that mean and scale of the training set is applied to the new data? Or a new centering and scaling is applied to the new data set, thus potentially using points in the future if the data are timeseries (which is problematic)?

Thank you


caret::predict.train uses parameters from the model you built to predict on the test set.

Here is a snippet from the source code that shows the preProc data comes from the object's preProcess parameters:

out <- predictionFunction(method = object$modelInfo, 
            modelFit = object$finalModel, newdata = newdata, 
            preProc = object$preProcess)

You can see these parameters for yourself after creating your model by accessing object$preProcess. Here is a complete example:


inTrain <- createDataPartition(y=mtcars$mpg,p=0.75,list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]

lmFit <- train(mpg~.,data=training,method="lm",preProc=c("center","scale"))