Gerd Marvin Gerd Marvin - 21 days ago 11
R Question

How to ignore linearly correlated variables introduced by factor reference cell coding

Assume I have a dataset containing two categorical predictor variables (a,b) and a binary target (y) variable.

> df <- data.frame(
> a = factor(c("cat1","cat2","cat3","cat1","cat2")),
> b = factor(c("cat1","cat1","cat3","cat2","cat2")),
> y = factor(c(T,F,T,F,T))
> )


The following logical relations exist in the data:

if (a = cat3) then (b = cat3 and y = true)
else if (a = b) then (y = true) else y = false


I want to use
glm
to build a model for my dataset.
glm
will automatically apply reference cell coding on my categorical variables a and b. It will also take care of finding the right number of codes for each factor variable, so that no
alias
variables are introduced (explained here).

However it can happen, as for the dataset above, that a linear relationship exists between one reference code generated for variable a and one reference code of variable b.

See the output of my model:

> model <- glm(y ~ ., family=binomial(link='logit'), data=df)
> summary(model)
...
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.965e-16 1.732e+00 0.000 1.000
acat2 -2.396e-16 2.000e+00 0.000 1.000
acat3 1.857e+01 6.523e+03 0.003 0.998
bcat2 0.000e+00 2.000e+00 0.000 1.000
bcat3 NA NA NA NA # <- get rid of this?


How should I handle this case?
Is there a way to tell glm to omit some of the generated reference codes?
In the real problem my
"cat3"
value corresponds to
NA
. I have two meaningful factor variables which are
NA
in exactly the same instances of my dataset.

EDIT:

The checked answer solves the question, however, in this specific case the singularities can simply be ignored as pointed out in the comments.

Answer

You could run it twice removing the redundant model matrix columns on the second run:

model <- glm(y ~ ., family=binomial(link='logit'), data=df) # as in question

mm <- model.matrix(model)[, !is.na(coef(model)) ]
df0 <- data.frame(y = df$y, mm[, -1])
update(model, data = df0)

giving:

Call:  glm(formula = y ~ ., family = binomial(link = "logit"), data = df0)

Coefficients:
(Intercept)        acat2        acat3        bcat2  
  1.965e-16   -2.396e-16    1.857e+01    0.000e+00  

Degrees of Freedom: 4 Total (i.e. Null);  1 Residual
Null Deviance:      6.73 
Residual Deviance: 5.545        AIC: 13.55