user4144725 user4144725 - 3 years ago 175
R Question

How to perform a K-fold cross validation and understanding the outputs

I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:

https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA

I used the following code:

library(DAAG)
six = read.csv("six.csv") #opening file

fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results

CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation


This produces the following output (Summarized version)

Sum of squares = 7.37 Mean square = 1.47 n = 5

Overall (Sum over all 5 folds)
ms
3.75

Warning message:

In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :

As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate


I do not understand what the ms value refers to as I have seen different interpretations on the internet. It is my understanding that K-fold cross validations produce a overall RMSE value for a specified model (which is what I am trying to obtain for my research).

I also don't understand why the results generated produce a Overall (Sum over all 5 folds), when I have specified a 10 fold cross validation in the code.

If anyone can help it would be much appreciated.

dmb dmb
Answer Source

When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):

msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage

Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:

fold 10 
Observations in test set: 5 
           12    14     26    54    56
Predicted   20.24 21.18 22.961 18.63 17.81
cvpred      20.15 21.14 22.964 18.66 17.86
Height      21.98 22.32 22.870 17.12 17.37
CV residual  1.83  1.18 -0.094 -1.54 -0.49

Sum of squares = 7.37    Mean square = 1.47    n = 5 

It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ

One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.

predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS  #Predicted Residual Sum of Squares Error
fitanova <- anova(fit)  #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq")   #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared

Notice this value is 0.574 vs. the original Rsquared of 0.6422

To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:

hist(predictedresiduals)

RMSE can then calculated simply as:

sd(predictedresiduals)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download