Gerd Marvin - 1 year ago 65
R Question

# How to ignore linearly correlated variables introduced by factor reference cell coding

Assume I have a dataset containing two categorical predictor variables (a,b) and a binary target (y) variable.

``````> df <- data.frame(
>  a = factor(c("cat1","cat2","cat3","cat1","cat2")),
>  b = factor(c("cat1","cat1","cat3","cat2","cat2")),
>  y = factor(c(T,F,T,F,T))
> )
``````

The following logical relations exist in the data:

``````if (a = cat3) then (b = cat3 and y = true)
else if (a = b) then (y = true) else y = false
``````

I want to use
`glm`
to build a model for my dataset.
`glm`
will automatically apply reference cell coding on my categorical variables a and b. It will also take care of finding the right number of codes for each factor variable, so that no
`alias`
variables are introduced (explained here).

However it can happen, as for the dataset above, that a linear relationship exists between one reference code generated for variable a and one reference code of variable b.

See the output of my model:

``````> model <- glm(y ~ ., family=binomial(link='logit'), data=df)
> summary(model)
...
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.965e-16  1.732e+00   0.000    1.000
acat2       -2.396e-16  2.000e+00   0.000    1.000
acat3        1.857e+01  6.523e+03   0.003    0.998
bcat2        0.000e+00  2.000e+00   0.000    1.000
bcat3               NA         NA      NA       NA # <- get rid of this?
``````

How should I handle this case?
Is there a way to tell glm to omit some of the generated reference codes?
In the real problem my
`"cat3"`
value corresponds to
`NA`
. I have two meaningful factor variables which are
`NA`
in exactly the same instances of my dataset.

EDIT:

The checked answer solves the question, however, in this specific case the singularities can simply be ignored as pointed out in the comments.

You could run it twice removing the redundant model matrix columns on the second run:

``````model <- glm(y ~ ., family=binomial(link='logit'), data=df) # as in question

mm <- model.matrix(model)[, !is.na(coef(model)) ]
df0 <- data.frame(y = df\$y, mm[, -1])
update(model, data = df0)
``````

giving:

``````Call:  glm(formula = y ~ ., family = binomial(link = "logit"), data = df0)

Coefficients:
(Intercept)        acat2        acat3        bcat2
1.965e-16   -2.396e-16    1.857e+01    0.000e+00

Degrees of Freedom: 4 Total (i.e. Null);  1 Residual
Null Deviance:      6.73
Residual Deviance: 5.545        AIC: 13.55
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download