Lisa Lisa - 1 month ago 12
R Question

lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to?

I am trying to run a mixed-effects model that predicts

F2_difference
with the rest of the columns as predictors, but I get an error message that says


fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.


From this link, Fixed-effects model is rank deficient, I think I should use
findLinearCombos
in the R package
caret
. However, when I try
findLinearCombos(data.df)
, it gives me the error message


Error in qr.default(object) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In qr.default(object) : NAs introduced by coercion


My data does not have any NAs - What could be causing this? (Sorry if the answer is various obvious - I am new to R).

All of my data are factors except the numerical value that I am trying to predict. Here is a small sample of my data.

sex <- c("f", "m", "f", "m")
nasal <- c("TRUE", "TRUE", "FALSE", "FALSE")
vowelLabel <- c("a", "e", "i", "o")
speaker <- c("Jim", "John", "Ben", "Sally")
word_1 <- c("going", "back", "bag", "back")
type <- c("coronal", "coronal", "labial", "velar")
F2_difference <- c(345.6, -765.8, 800, 900.5)
data.df <- data.frame(sex, nasal, vowelLabel, speaker,
word_1, type, F2_difference
stringsAsFactors = TRUE)


Edit:
Here is some more code, if it helps.

formula <- F2_difference ~ sex + nasal + type + vowelLabel +
type * vowelLabel + nasal * type +
(1|speaker) + (1|word_1)

lmer(formula, REML = FALSE, data = data.df)


Editor edit:

The OP did not provide sufficient number of test data to allow an actual run of the model in
lmer
for the reader. But this is not too big a issue. This is still a very good post!

Answer

I think, you are slightly over-concerned with the warning message:

fixed-effect model matrix is rank deficient so dropping 7 columns / coefficients.

It is just a warning message, not an error message. Since there is neither misuse of lmer function, nor ill-specification of your model formula, you should be able to obtain an estimated model.

But anyway, since you asked, I shall try my best to explain the story behind this.

During execution of lmer, your original model formula is broken into to parts: fixed effect formula and random effect formula. For each part, a model matrix is constructed. The construction for the fixed part is passed to the standard model matrix constructor model.matrix (formula, data.frame), while the construction for the random part is slightly more complicated, but is nothing related to your question, so we skip it.

For your model, you can check what the fixed effect model matrix looks like by:

fix.formula <- F2_difference ~ sex + nasal + type + vowelLabel + 
               type * vowelLabel + nasal * type
X <- model.matrix (fix.formula, data.df)

Since all your variables like sex, etc are factors, X will be a 0-1 binary matrix. Though model.matrix will apply contrast for each factor and the interaction between factor, like type * vowelLable, it is still possible that X does not have full column rank. This is because those variables may be linearly correlated. Therefore, any well-written model fitting routine, like lm, glm, etc., will apply QR decomposition for X to use only the non-deficient subspace of X for model fitting, shrinking other coefficients to 0 or NA. The warning message you get is a signal that such shrinkage happens. There are originally ncol(X) number of coefficients to estimate, however due to rank-deficiency, only ncol(X) - 7 will be estimated, with the remaining 7 being 0 or NA.

Such shrinkage is actually a good sign, rather than a bad one, because it ensures numerical stability in solving linear systems of equations during model fitting. So you should not bother with it. Correlation between variables are too often in real-world statistical modelling; modeller can not have control on this. Therefore, rank detection and stability should always be passed to fitting routine to deal with. Unless you are writing a model fitting routine, like lm, etc, in all cases, you should not deal with rank problem manually. That is just an ugly piece of work for non-expert.

To help you know better, let's exploit the lm function to check what will happen, if we fit a linear model with fix.formula:

fix.fit <- lm (fix.formula, data.df, method = "qr", singular.ok = TRUE)

Note that, the method = "qr" and singular.ok = TRUE are default, so actually we don't need to set it here. But writing it out helps to see that lm uses QR decomposition for X to allow rank-deficient model to fit. On the other hand, if set singular.ok = FALSE:

oo <- lm (fix.formula, data.df, method = "qr", singular.ok = FALSE)

lm will stop and complain about rank-deficiency.

Now let's check about the returned values in fix.fit.

coef <- fix.fit$coef
p <- length(coef)

stores all coefficients for model matrix X. It is guaranteed that the number of coefficients p = ncol(X). However, if you check coef by

no.NA <- sum(is.na(coef))

you should see 7 NA. Surprising? Now let's check rank

rank <- fix.fit$rank

This is the numerical rank for X. You should verify that rank + no.NA = p.

Though I am talking about lm, but the same thing happens inside lmer as well. The only difference, is that lm will not report rank-deficiency, while lmer will report this as a warning. This is actually a good thing. Too often, I see people asking why lm returning NA for some coefficients. Since lm does not report, this can be confusing for non-expert.

Hope my explanation is clear and helpful.


Update:

Don't worry about the use of anova and BIC either, the package function will take care of them properly.

I have also edited your original post, by adding tags lme4, linear regression, regression, and renaming the title to be more informative. Hopefully, people with the same doubt can easily find your post via Google, etc.