user7090012 - 9 months ago 53

R Question

I am running this function to do n-fold cross-validation. The misclassification rate does not vary over folds, e.g. if I run 10 or 50. I am also getting a warning:

"Warning message:

'newdata' had 19 rows but variables found have 189 rows"

If I run the code without being part of a function, it is doing want I want -> e.g. for folds==1, it is pulling out 10%, running the model on 90% of the data, and predicting the other 10%.

Does anyone have any ideas as to why it is not showing variation by variable and the number of folds?

`library("MASS")`

data(birthwt)

data=birthwt

n.folds=10

jim = function(x,y,n.folds,data){

for(i in 1:n.folds){

folds <- cut(seq(1,nrow(data)),breaks=n.folds,labels=FALSE)

testIndexes <- which(folds==i,arr.ind=TRUE)

testData <- data[testIndexes, ]

trainData <- data[-testIndexes, ]

glm.train <- glm(y ~ x, family = binomial, data=trainData)

predictions=predict(glm.train, newdata =testData, type='response')

pred.class=ifelse(predictions< 0, 0, 1)

}

rate=sum(pred.class!= y) / length(y)

print(head(rate))

}

jim(birthwt$smoke, birthwt$low, 10, birthwt)

Answer Source

I am now making my comments into an answer.

```
jim <- function(x, y, n.folds, data) {
pred.class <- numeric(0) ## initially empty; accumulated later
for(i in 1:n.folds){
folds <- cut(seq(1,nrow(data)), breaks = n.folds, labels = FALSE)
testIndexes <- which(folds == i) ## no need for `arr.ind = TRUE`
testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]
## `reformulate` constructs formula from strings. Read `?reformulate`
glm.train <- glm(reformulate(x, y), family = binomial, data = trainData)
predictions <- predict(glm.train, newdata = testData, type = 'response')
## accumulate the result using `c()`
## change `predictions < 0` to `predictions < 0.5` as `type = response`
pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1))
}
## to access a column with string, use `[[]]` not `$`
rate <- sum(pred.class!= data[[y]]) / length(data[[y]])
rate ## or `return(rate)`
}
jim("smoke", "low", 10, birthwt)
# [1] 0.3121693
```

Remark:

- No need to put
`arr.ind = TRUE`

here, although it has no side-effect. - There is something wrong with your classification. You set
`type = "response"`

, then you use`ifelse(predictions < 0, 0, 1)`

. Think about it, you always get 1 for`pred.class`

. - Each iteration of your
`for`

loop overwrites the`pred.class`

. I think you want to accumulate the result. So do`pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1))`

; - Wrong use of
`glm`

and`predict`

. It is wrong to put`$`

in model formula. Please read Predict() - Maybe I'm not understanding it. Here, I have changed your function to accept variable names (as a string), and use proper model formula inside`glm`

. Note, this change requires to place`y`

with`data[[y]]`

in`rate = sum(pred.class!= y) / length(y)`

. - You probably want to return
`rate`

rather than just printing it to screen. So replace your`print`

line by explicit`return(rate)`

, or implicit`rate`

. - You can replace
`ifelse(predictions < 0.5, 0, 1)`

with`as.integer(predictions < 0.5)`

, although I did not change it in above.