I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:
https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA
I used the following code:
library(DAAG)
six = read.csv("six.csv") #opening file
fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results
CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation
Sum of squares = 7.37 Mean square = 1.47 n = 5
Overall (Sum over all 5 folds)
ms
3.75
Warning message:
In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :
As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate
When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ
One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
Notice this value is 0.574 vs. the original Rsquared of 0.6422
To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:
hist(predictedresiduals)
RMSE can then calculated simply as:
sd(predictedresiduals)