user4144725 - 3 years ago 175

R Question

I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:

https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA

I used the following code:

`library(DAAG)`

six = read.csv("six.csv") #opening file

fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model

summary(fit) # show results

CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation

This produces the following output (Summarized version)

`Sum of squares = 7.37 Mean square = 1.47 n = 5`

Overall (Sum over all 5 folds)

ms

3.75

Warning message:

In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :

As there is >1 explanatory variable, cross-validation

predicted values for a fold are not a linear function

of corresponding overall predicted values. Lines that

are shown for the different folds are approximate

I do not understand what the

I also don't understand why the results generated produce a

If anyone can help it would be much appreciated.

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):

```
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
```

Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:

```
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
```

It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ

One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.

```
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
```

Notice this value is 0.574 vs. the original Rsquared of 0.6422

To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:

```
hist(predictedresiduals)
```

RMSE can then calculated simply as:

```
sd(predictedresiduals)
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**