Rory78 - 1 year ago 78
R Question

# k-fold cross-validation to find misclassification rate

I am trying to write a function which takes a binary response variable y and a single explanatory variable x, runs 10-fold cross validation and returns the proportion of the response variables y that are incorrectly classified.
I then want to run this function to find out which of the explanatory variables are best at predicting low birth weight.

So far I have tried

``````attach(birthwt)
folds <- cut(seq(1,nrow(birthwt)),breaks=10,labels=FALSE) #Create 10 equally size folds
bwt<-glm(low~age+lwt+race+smoke+ptl+ht+ui+ftv,family=binomial)
``````

But I really don't know where to go from here..

Thank you

For each group, you need to construct a training and test data set, fit the model on the training set, and then use the `predict.glm` function to do the actual predictions in the test data. The comments below should be enough explanation.

``````library("MASS")   # for the dataset
library("dplyr")  # for the filter function
data(birthwt)
set.seed(1)

n.folds <- 10
folds <- cut(sample(seq_len(nrow(birthwt))),  breaks=n.folds, labels=FALSE) # Note!

all.confustion.tables <- list()
for (i in seq_len(n.folds)) {
# Create training set
train <- filter(birthwt, folds != i)  # Take all other samples than i

# Create test set
test <- filter(birthwt, folds == i)

# Fit the glm model on the training set
glm.train <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
family = binomial, data =  train)

# Use the fitted model on the test set
logit.prob <- predict(glm.train, newdata = test)

# Classifiy based on the predictions
pred.class <- ifelse(logit.prob < 0, 0, 1)

# Construct the confusion table
all.confustion.tables[[i]] <- table(pred = pred.class, true = test\$low)
}
``````

Note, that I randomly select the rows to group. This is important.

We can then see the prediction versus the true values (given in the confusion tables) for the, say, 5th cross-validation run:

``````all.confustion.tables[[5]]
#    true
# pred 0 1
#    0 7 8
#    1 4 0
``````

Hence, in this particular run, we have 7 true positives, 4 false positives, 8 false negatives, and 0 true negatives.

We can now compute whatever measure of performance we want from our confusion tables.

``````# Compute the average misclassification risk.
misclassrisk <- function(x) { (sum(x) - sum(diag(x)))/sum(x) }
risk <- sapply(all.confustion.tables, misclassrisk
mean(risk)
#[1] 0.3119883
``````

So, in this case, we have a misclassification risk of about 31 %. I'll leave it to you to wrap this into a function.