Gerd Marvin - 7 months ago 33

R Question

Assume I have a dataset containing two categorical predictor variables (a,b) and a binary target (y) variable.

`> df <- data.frame(`

> a = factor(c("cat1","cat2","cat3","cat1","cat2")),

> b = factor(c("cat1","cat1","cat3","cat2","cat2")),

> y = factor(c(T,F,T,F,T))

> )

The following logical relations exist in the data:

`if (a = cat3) then (b = cat3 and y = true)`

else if (a = b) then (y = true) else y = false

I want to use

`glm`

`glm`

`alias`

However it can happen, as for the dataset above, that a linear relationship exists between one reference code generated for variable a and one reference code of variable b.

See the output of my model:

`> model <- glm(y ~ ., family=binomial(link='logit'), data=df)`

> summary(model)

...

Coefficients: (1 not defined because of singularities)

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.965e-16 1.732e+00 0.000 1.000

acat2 -2.396e-16 2.000e+00 0.000 1.000

acat3 1.857e+01 6.523e+03 0.003 0.998

bcat2 0.000e+00 2.000e+00 0.000 1.000

bcat3 NA NA NA NA # <- get rid of this?

How should I handle this case?

Is there a way to tell glm to omit some of the generated reference codes?

In the real problem my

`"cat3"`

`NA`

`NA`

The checked answer solves the question, however, in this specific case the singularities can simply be ignored as pointed out in the comments.

Answer

You could run it twice removing the redundant model matrix columns on the second run:

```
model <- glm(y ~ ., family=binomial(link='logit'), data=df) # as in question
mm <- model.matrix(model)[, !is.na(coef(model)) ]
df0 <- data.frame(y = df$y, mm[, -1])
update(model, data = df0)
```

giving:

```
Call: glm(formula = y ~ ., family = binomial(link = "logit"), data = df0)
Coefficients:
(Intercept) acat2 acat3 bcat2
1.965e-16 -2.396e-16 1.857e+01 0.000e+00
Degrees of Freedom: 4 Total (i.e. Null); 1 Residual
Null Deviance: 6.73
Residual Deviance: 5.545 AIC: 13.55
```

Source (Stackoverflow)