user7090012 user7090012 - 1 month ago 17
R Question

`predict` error while doing n-fold cross-validation for my GLM

I am running this function to do n-fold cross-validation. The misclassification rate does not vary over folds, e.g. if I run 10 or 50. I am also getting a warning:


"Warning message:

'newdata' had 19 rows but variables found have 189 rows"


If I run the code without being part of a function, it is doing want I want -> e.g. for folds==1, it is pulling out 10%, running the model on 90% of the data, and predicting the other 10%.
Does anyone have any ideas as to why it is not showing variation by variable and the number of folds?

library("MASS")
data(birthwt)
data=birthwt

n.folds=10

jim = function(x,y,n.folds,data){

for(i in 1:n.folds){
folds <- cut(seq(1,nrow(data)),breaks=n.folds,labels=FALSE)
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]
glm.train <- glm(y ~ x, family = binomial, data=trainData)
predictions=predict(glm.train, newdata =testData, type='response')
pred.class=ifelse(predictions< 0, 0, 1)
}

rate=sum(pred.class!= y) / length(y)
print(head(rate))
}

jim(birthwt$smoke, birthwt$low, 10, birthwt)

Answer

I am now making my comments into an answer.

jim <- function(x, y, n.folds, data) {   

  pred.class <- numeric(0)  ## initially empty; accumulated later
  for(i in 1:n.folds){
    folds <- cut(seq(1,nrow(data)), breaks = n.folds, labels = FALSE)  
    testIndexes <- which(folds == i)  ## no need for `arr.ind = TRUE`
    testData <- data[testIndexes, ]
    trainData <- data[-testIndexes, ]
    ## `reformulate` constructs formula from strings. Read `?reformulate`
    glm.train <- glm(reformulate(x, y), family = binomial, data = trainData)
    predictions <- predict(glm.train, newdata = testData, type = 'response')
    ## accumulate the result using `c()`
    ## change `predictions < 0` to `predictions < 0.5` as `type = response`
    pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1))
    }

  ## to access a column with string, use `[[]]` not `$`
  rate <- sum(pred.class!= data[[y]]) / length(data[[y]])
  rate  ## or `return(rate)`
  }

jim("smoke", "low", 10, birthwt)
# [1] 0.3121693

Remark:

  1. No need to put arr.ind = TRUE here, although it has no side-effect.
  2. There is something wrong with your classification. You set type = "response", then you use ifelse(predictions < 0, 0, 1). Think about it, you always get 1 for pred.class.
  3. Each iteration of your for loop overwrites the pred.class. I think you want to accumulate the result. So do pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1));
  4. Wrong use of glm and predict. It is wrong to put $ in model formula. Please read Predict() - Maybe I'm not understanding it. Here, I have changed your function to accept variable names (as a string), and use proper model formula inside glm. Note, this change requires to place y with data[[y]] in rate = sum(pred.class!= y) / length(y).
  5. You probably want to return rate rather than just printing it to screen. So replace your print line by explicit return(rate), or implicit rate.
  6. You can replace ifelse(predictions < 0.5, 0, 1) with as.integer(predictions < 0.5), although I did not change it in above.
Comments