user3096418 user3096418 - 2 months ago 19
R Question

Finding misclassification rate in R...loan default

I have a data set with 10000 consumer loans and I created a model to predict whether a person will default or not, and the response variable is 1 (default) or 0 (did not default). I used step() to find a glm model with a training set (8000 points) but my task is to determine the model's effectiveness at predicting default for the testing set (2000 points). R is spitting out huge numbers when I try to get the error rate:

My reg has the Y response and 6 x variables. This is how I'm trying to get the error rate:

preddreg <- predict(dreg, newdata=test, type="response")
predfull <- predict(full, newdata=test, type="response")
errorreg <- (test,1) - (preddreg = 1)
errorfull <- (test,1) - (predfull = 1)

mean(abs(errorreg))
##I keep getting 37, it should be a small decimal in the .20 range
mean(abs(error full))
##I get the same huge number


Is there an easier way to check a test set of data to get the misclassification rate? I'm pulling my hair out and have spent 10 hours trying to get a reasonable error rate..

Answer

The syntax preddreg = 1 doesn't make a lot of sense here. If you're going for misclassification rate, you need to set a threshold for the predicted probabilities. Here's how to get the misclassifications for the reg model, using a threshold of 0.5. Here, I assume default is the name of your outcome variable (I couldn't tell the name from reading your post):

preddreg <- predict(dreg, newdata=test, type="response")

# Rows are correct outcome, columns are prediction with threshold 0.5
tab <- table(test$default, preddreg >= 0.5)
tab   # Display the confusion matrix
accuracy.reg <- sum(diag(tab)) / sum(tab)
accuracy.reg  # Output accuracy