Rory78 Rory78 - 1 year ago 78
R Question

k-fold cross-validation to find misclassification rate

I am trying to write a function which takes a binary response variable y and a single explanatory variable x, runs 10-fold cross validation and returns the proportion of the response variables y that are incorrectly classified.
I then want to run this function to find out which of the explanatory variables are best at predicting low birth weight.

So far I have tried

folds <- cut(seq(1,nrow(birthwt)),breaks=10,labels=FALSE) #Create 10 equally size folds

But I really don't know where to go from here..

Thank you

Answer Source

For each group, you need to construct a training and test data set, fit the model on the training set, and then use the predict.glm function to do the actual predictions in the test data. The comments below should be enough explanation.

library("MASS")   # for the dataset
library("dplyr")  # for the filter function

n.folds <- 10
folds <- cut(sample(seq_len(nrow(birthwt))),  breaks=n.folds, labels=FALSE) # Note!

all.confustion.tables <- list()
for (i in seq_len(n.folds)) {
  # Create training set
  train <- filter(birthwt, folds != i)  # Take all other samples than i

  # Create test set
  test <- filter(birthwt, folds == i)

  # Fit the glm model on the training set
  glm.train <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                   family = binomial, data =  train)

  # Use the fitted model on the test set
  logit.prob <- predict(glm.train, newdata = test)

  # Classifiy based on the predictions
  pred.class <- ifelse(logit.prob < 0, 0, 1)

  # Construct the confusion table
  all.confustion.tables[[i]] <- table(pred = pred.class, true = test$low)

Note, that I randomly select the rows to group. This is important.

We can then see the prediction versus the true values (given in the confusion tables) for the, say, 5th cross-validation run:

#    true
# pred 0 1
#    0 7 8
#    1 4 0

Hence, in this particular run, we have 7 true positives, 4 false positives, 8 false negatives, and 0 true negatives.

We can now compute whatever measure of performance we want from our confusion tables.

# Compute the average misclassification risk.
misclassrisk <- function(x) { (sum(x) - sum(diag(x)))/sum(x) }
risk <- sapply(all.confustion.tables, misclassrisk
#[1] 0.3119883

So, in this case, we have a misclassification risk of about 31 %. I'll leave it to you to wrap this into a function.