Lawrence303 Lawrence303 - 3 months ago 43
R Question

Why is predict.glmnet not predicting probabilities?

I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:

set.seed(1)
OBP <- rnorm(50, mean=1, sd=.2)
HR.PCT <- rnorm(50, mean=1, sd=.2)
AGE <- rnorm(50, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
df.train <- data.frame(OBP, HR.PCT, AGE, CONF)
df.train <- df.train[order(-OBP),]
df.train$MADE.MAJORS <- 0
df.train$MADE.MAJORS[1:10] <- 1

OBP <- rnorm(10, mean=1, sd=.2)
HR.PCT <- rnorm(10, mean=1, sd=.2)
AGE <- rnorm(10, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))
df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)


I then used
glmnet
to perform the lasso with logistic regression and generate predictions. I want the predictions to be in the form of probabilities (that is, between 0 and 1).

library(glmnet)
train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)

test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")
cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")


Here are the predictions:

> preds
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051

> cv.preds
1
1 0.1568839
2 0.3630938
3 0.7435941
4 0.4808428
5 0.5261076
6 -0.1431655
7 0.4123054
8 0.2207381
9 -0.1446941
10 0.2962391


I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.


  1. Why are the predictions from
    predict.glmnet
    (the
    preds
    vector) not in the form of probabilities? I put the
    preds
    values through the inverse logit function and got reasonable probabilities. Was that correct?

  2. The predictions from
    predict.cv.glmnet
    (the
    cv.preds
    vector) mostly look like probabilities, but some of them are negative. Why is this?

  3. When I use the
    glmnet
    function to create the glmmod object, I include the
    family="binomial"
    argument to indicate that I'm using logistic regression. However, when I use the
    cv.glmnet
    function to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?

  4. Similarly, when I use the
    predict.cv.glmnet
    function, I'm not able to specify logistic regression. Does this function produce the predictions that I want?


Answer

I am not 100% sure on the following because the package does seem to operate counter to its documentation, as you've noticed, but it may produce some indication whether your thinking is along the right path.

Question 1

Yes, you're right. Note that,

> predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="link")
            1
1  -3.2589440
2  -0.4435265
3   3.9646670
4   0.3772816
5   0.9952887
6  -7.3555661
7   0.2283675
8  -2.3871317
9  -8.1632749
10 -1.3563051

which is the same output as type="response". Thus, putting it through the inverse logit function would be the right way to get the probabilities. As to why is this happening, I have not clue -perhaps a bug.

Question 2...4

For the cv.preds, you're getting something along the lines of probabilities because you're fitting a Gaussian link. In order to fit a logit link, you should specify the family parameter. Namely:

cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1, family="binomial")

> cv.preds
            1
1  -10.873290
2    1.299113
3   15.812671
4    3.622259
5    5.621857
6  -24.826551
7    1.734000
8   -5.420878
9  -26.160403
10  -4.496020

In this case, cv.preds will output along the real line and you can put those values through the inverse logit to get the probabilities.

Comments