Rory78 - 1 year ago 63

R Question

I am trying to write a function which takes a binary response variable y and a single explanatory variable x, runs 10-fold cross validation and returns the proportion of the response variables y that are incorrectly classified.

I then want to run this function to find out which of the explanatory variables are best at predicting low birth weight.

So far I have tried

`attach(birthwt)`

folds <- cut(seq(1,nrow(birthwt)),breaks=10,labels=FALSE) #Create 10 equally size folds

bwt<-glm(low~age+lwt+race+smoke+ptl+ht+ui+ftv,family=binomial)

But I really don't know where to go from here..

Thank you

Answer Source

For each group, you need to construct a training and test data set, fit the model on the training set, and then use the `predict.glm`

function to do the actual predictions in the test data. The comments below should be enough explanation.

```
library("MASS") # for the dataset
library("dplyr") # for the filter function
data(birthwt)
set.seed(1)
n.folds <- 10
folds <- cut(sample(seq_len(nrow(birthwt))), breaks=n.folds, labels=FALSE) # Note!
all.confustion.tables <- list()
for (i in seq_len(n.folds)) {
# Create training set
train <- filter(birthwt, folds != i) # Take all other samples than i
# Create test set
test <- filter(birthwt, folds == i)
# Fit the glm model on the training set
glm.train <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
family = binomial, data = train)
# Use the fitted model on the test set
logit.prob <- predict(glm.train, newdata = test)
# Classifiy based on the predictions
pred.class <- ifelse(logit.prob < 0, 0, 1)
# Construct the confusion table
all.confustion.tables[[i]] <- table(pred = pred.class, true = test$low)
}
```

Note, that I randomly select the rows to group. This is important.

We can then see the prediction versus the true values (given in the confusion tables) for the, say, 5th cross-validation run:

```
all.confustion.tables[[5]]
# true
# pred 0 1
# 0 7 8
# 1 4 0
```

Hence, in this particular run, we have 7 true positives, 4 false positives, 8 false negatives, and 0 true negatives.

We can now compute whatever measure of performance we want from our confusion tables.

```
# Compute the average misclassification risk.
misclassrisk <- function(x) { (sum(x) - sum(diag(x)))/sum(x) }
risk <- sapply(all.confustion.tables, misclassrisk
mean(risk)
#[1] 0.3119883
```

So, in this case, we have a misclassification risk of about 31 %. I'll leave it to you to wrap this into a function.