Dan Scott - 3 years ago 105
R Question

Determining Interpretation Direction of Logistic Model in R

I am trying to run a logistic regression to predict a variable called has_sed (binary, describes whether a sample has sediment or not, coded as 0 = does not have sediment and 1 = has sediment). See the summary output of this model below:

``````Call:
glm(formula = has_sed ~ vw + ws_avg + s, family = binomial(link = "logit"),
data = spdata_ss)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.4665  -0.8659  -0.6325   1.1374   2.3407

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.851966   0.667291   1.277 0.201689
vw          -0.118140   0.031092  -3.800 0.000145 ***
ws_avg      -0.015815   0.008276  -1.911 0.055994 .
s            0.034471   0.019216   1.794 0.072827 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 296.33  on 241  degrees of freedom
Residual deviance: 269.91  on 238  degrees of freedom
AIC: 277.91

Number of Fisher Scoring iterations: 4
``````

Now, I understand how to interpret a logistic model output like this in general, but I don't understand how R chooses the direction (may be a better word for that) of my dependent variable. How do I know if a unit increase in vw increases the log odds of a sample having sediment, or increases the log odds of that sample not having sediment (i.e., has_sed = 0 vs has_sed = 1)?

I plotted out each of these relationships with boxplots, and the sign on the estimates in the logistic model output looks reversed from what I'm seeing in the boxplots. So, does R calculate the log odds of has_sed being 0, or the log odds of it being 1?

This is best illustrated with an example, I will use iris data with two classes

``````data(iris)
iris2 = iris[iris\$Species!="setosa",]
iris2\$Species = factor(iris2\$Species)
levels(iris2\$Species)
#output[1] "versicolor" "virginica"
``````

Lets make a glm

``````   model = glm(Species ~ Petal.Length, data = iris2, family = binomial(link = "logit"))
summary(model)
#truncated output
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -43.781     11.110  -3.941 8.12e-05 ***
Petal.Length    9.002      2.283   3.943 8.04e-05 ***

library(ggplot2)

ggplot(iris2)+
geom_boxplot(aes(x = Species, y = Petal.Length))
``````

Chances of being "virginica" rise with increasing Petal.Length, the reference level was "versicolor" - the first level when we did `levels(iris2\$Species)`.

Lets change it

``````iris2\$Species = relevel(iris2\$Species, ref = "virginica")
levels(iris2\$Species)
#output
[1] "virginica"  "versicolor"

model2 = glm(Species ~ Petal.Length, data = iris2, family = binomial(link = "logit"))
summary(model2)
#truncated output
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    43.781     11.110   3.941 8.12e-05 ***
Petal.Length   -9.002      2.283  -3.943 8.04e-05 ***
``````

Now the reference level is "virginica" the first level in `levels(iris2\$Species)`. Chances of being "versicolor" drop with increasing Petal.Length.

In short the order of levels in your response variable determines the reference level for treatment contrasts.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download