Lawrence303 - 4 months ago 53

R Question

I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:

`set.seed(1)`

OBP <- rnorm(50, mean=1, sd=.2)

HR.PCT <- rnorm(50, mean=1, sd=.2)

AGE <- rnorm(50, mean=21, sd=1)

CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)

CONF <- factor(CONF, levels=c("A","B","C","D","E"))

df.train <- data.frame(OBP, HR.PCT, AGE, CONF)

df.train <- df.train[order(-OBP),]

df.train$MADE.MAJORS <- 0

df.train$MADE.MAJORS[1:10] <- 1

OBP <- rnorm(10, mean=1, sd=.2)

HR.PCT <- rnorm(10, mean=1, sd=.2)

AGE <- rnorm(10, mean=21, sd=1)

CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)

CONF <- factor(CONF, levels=c("A","B","C","D","E"))

MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))

df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)

I then used

`glmnet`

`library(glmnet)`

train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])

glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")

cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)

test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])

preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")

cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")

Here are the predictions:

`> preds`

1

1 -3.2589440

2 -0.4435265

3 3.9646670

4 0.3772816

5 0.9952887

6 -7.3555661

7 0.2283675

8 -2.3871317

9 -8.1632749

10 -1.3563051

> cv.preds

1

1 0.1568839

2 0.3630938

3 0.7435941

4 0.4808428

5 0.5261076

6 -0.1431655

7 0.4123054

8 0.2207381

9 -0.1446941

10 0.2962391

I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.

- Why are the predictions from (the
`predict.glmnet`

vector) not in the form of probabilities? I put the`preds`

values through the inverse logit function and got reasonable probabilities. Was that correct?`preds`

- The predictions from (the
`predict.cv.glmnet`

vector) mostly look like probabilities, but some of them are negative. Why is this?`cv.preds`

- When I use the function to create the glmmod object, I include the
`glmnet`

argument to indicate that I'm using logistic regression. However, when I use the`family="binomial"`

function to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?`cv.glmnet`

- Similarly, when I use the function, I'm not able to specify logistic regression. Does this function produce the predictions that I want?
`predict.cv.glmnet`

Answer

I am not 100% sure on the following because the package does seem to operate counter to its documentation, as you've noticed, but it may produce some indication whether your thinking is along the right path.

Yes, you're right. Note that,

```
> predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="link")
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
```

which is the same output as `type="response"`

. Thus, putting it through the inverse logit function would be the right way to get the probabilities. As to why is this happening, I have not clue -perhaps a bug.

For the `cv.preds`

, you're getting something along the lines of probabilities because you're fitting a Gaussian link. In order to fit a logit link, you should specify the `family`

parameter. Namely:

```
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1, family="binomial")
> cv.preds
1
1 -10.873290
2 1.299113
3 15.812671
4 3.622259
5 5.621857
6 -24.826551
7 1.734000
8 -5.420878
9 -26.160403
10 -4.496020
```

In this case, `cv.preds`

will output along the real line and you can put those values through the inverse logit to get the probabilities.