mlal mlal - 1 year ago 100
R Question

Does predict function in caret package use future information when preprocessing?

My question is pretty simple but I can't find a clear cut answer using caret package doc.
If I use the preprocessing options center and scale in my train function, it is stated that the same preprocesing will be applied to new data set while doing predictions.

So when I use the predict function:
Does it mean that mean and scale of the training set is applied to the new data? Or a new centering and scaling is applied to the new data set, thus potentially using points in the future if the data are timeseries (which is problematic)?

Thank you

Answer Source

caret::predict.train uses parameters from the model you built to predict on the test set.

Here is a snippet from the source code that shows the preProc data comes from the object's preProcess parameters:

out <- predictionFunction(method = object$modelInfo, 
            modelFit = object$finalModel, newdata = newdata, 
            preProc = object$preProcess)

You can see these parameters for yourself after creating your model by accessing object$preProcess. Here is a complete example:

rm(list=ls())
library(caret)
set.seed(4444)

data(mtcars)
inTrain <- createDataPartition(y=mtcars$mpg,p=0.75,list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]

lmFit <- train(mpg~.,data=training,method="lm",preProc=c("center","scale"))
lmFit$preProcess