christoph - 11 days ago 5
R Question

# Why do I get NA coefficients and how does `lm` drop reference level for interaction

I am trying to understand how R determines reference groups for interactions in a linear model. Consider the following:

``````df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"),
treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
y = c(1.4068142116718, 2.67041187927052, 2.69166439169131,
3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572,
5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286,
3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795,
5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801,
5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279,
5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312,
8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id",
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")

lm(y ~ -1 + id + year + year:treatment, df)

#Coefficients:
#             id1               id2               id3               id4
#          2.6585            3.9933            4.1161            5.3544
#             id5             year2  year1:treatment1  year2:treatment1
#          6.1991            0.7149           -0.6317                NA
``````

R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting
`NA`
's in the results.

Also, R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (
`year1`
) in the main effect and interactions. How to force R to omit
`year1`
and
`year1:treatment1`
from the preceding model?

I understand that there are several workarounds for this problem (e.g. creating all the variables by hand and writing them out in the model's formula). But the actual models I am estimating are much more complicated versions of this problem and such a workaround would be cumbersome.

R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting `NA`s in the results.

There is no causality between those two. You get `NA` purely because your variables have nesting.

R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (`year1`) in the main effect and interactions. How to force R to omit `year1` and `year1:treatment1` from the preceding model?

There is no inconsistency but `model.matrix` has its own rule. You get seemingly "inconsistent" contrasts because you don't have main effect `treatment` but only interaction `treatment:year`.

In the following, I will first explain `NA` coefficients, then the contrasts for interaction, and finally give some suggestions.

# `NA` coefficients

By default, contrast treatment is used for contrasting factor, and by default of `contr.treatement`, the first factor level is taken as reference level. Have a look at your data:

``````levels(df\$id)
# [1] "1" "2" "3" "4" "5"

levels(df\$year)
# [1] "1" "2"

levels(df\$treatment)
# [1] "0" "1"
``````

Now take a look at a simple linear model:

``````lm(y ~ id + year + treatment, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5        year2
#      2.153        1.651        1.773        2.696        3.541        1.094
# treatment1
#         NA
``````

You can see that `id1`, `year1` and `treatment0` are not there, as they are taken as reference.

Even without interaction, you already have an `NA` coefficient. This implies that `treatment` is nested in `span{id, year}`. When you further include `treatment:year`, such nesting still exists.

In fact, a further test shows that `treatment` is nested in `id`:

``````lm(y ~ id + treatment, df)

#    Coefficients:
#(Intercept)          id2          id3          id4          id5   treatment1
#      2.700        1.651        1.773        2.696        3.541           NA
``````

In summary, variable `treatment` is completely redundant for your modelling purpose. If you include `id`, then there is no need to include `treatment` or `treatment:*` where `*` can be any variable.

It is very easy to get nesting when you have many factor variables in a regression model. Note, contrasts will not necessarily remove nesting, as contrast only recognises variable name, but not potential numerical feature. Take the following example for how to cheat `contr.treatment`:

``````df\$ID <- df\$id
lm(y ~ id + ID, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5          ID2
#      2.700        1.651        1.773        2.696        3.541           NA
#        ID3          ID4          ID5
#         NA           NA           NA
``````

Look, contrasts works as expected, but `ID` is nested in `id` so we have rank-deficiency.

# Contrasts for interaction

We first remove the noise imposed by `NA`, by dropping `id` variable. Then, a regression model with `treatment` and `year` will be full-rank, so no `NA` should be seen if contrasts is successful.

Contrasts for interaction, or high-order effects, depends on the presence of low-order effects. Compare the following models:

``````lm(y ~ year:treatment, df)  ## no low-order effects at all

#Coefficients:
#     (Intercept)  treatment0:year1  treatment1:year1  treatment0:year2
#          5.4523           -1.3976           -1.3466           -0.6826
#treatment1:year2
#              NA

lm(y ~ year + treatment + year:treatment, df)  ## with all low-order effects

#Coefficients:
#     (Intercept)        treatment1             year2  treatment1:year2
#         4.05471           0.05094           0.71493           0.63170
``````

In the first model, no contrasts is done, because there is no presence of main effects, but only the 2nd order effect. The `NA` here is due to the absence of contrasts.

Now consider another set of examples, by including some but not all main effects:

``````lm(y ~ year + year:treatment, df)  ## main effect `treatment` is missing

#Coefficients:
#     (Intercept)             year2  year1:treatment1  year2:treatment1
#         4.05471           0.71493           0.05094           0.68264

lm(y ~ treatment + year:treatment, df)  ## main effect `year` is missing

#Coefficients:
#     (Intercept)        treatment1  treatment0:year2  treatment1:year2
#         4.05471           0.05094           0.71493           1.34663
``````

Here, contrasts for interaction will happen to the variable whose main effect is missing. For example, in the first model, main effect `treatment` is missing so interaction drops `treatement0`; while in the second model, main effect `year` is missing so interaction drops `year1`.

# A general guideline

Always include all low-order effects when specifying high-order effects. This not only gives easy-to-understand contrasts behaviour, but also has some other appealing statistical reason. You can read Including the interaction but not the main effects in a model on Cross Validated.

Another suggestion, is to always include intercept. In linear regression, a model with intercept yields unbiased estimate, and residuals will have 0 mean.