Alex Alex - 1 month ago 5x
R Question

Why does lm run out of memory while matrix multiplication works fine for coefficients?

I am trying to do fixed effects with R. My data looks like this

dte, yr, id, v1, v2
and has daily date values. I would like to include dummy variables for
and for
is the date. If i try to use
and specify the index as
index=c("id, "yr")
and do a
model with any sort of
I get the error that
(yr, id)
is not unique, which is true since my data is daily.

I then decided to simply do this by making
a factor and using

lm(v1 ~ factor(yr) + v2 - 1, data=df)

However, this seems to run out of memory. I have 20 levels in my factor and
is 14 mil rows which takes about 2 gigs to store, I am running this on a machine with 22 gigs dedicated to this process. I then decided to try things the old fashioned way: create dummy variables for each of my years
by doing:

df$t1 <- 1*(df$yr==1)
df$t2 <- 1*(df$yr==2)
df$t3 <- 1*(df$yr==3)


and simply compute:

solve(t(x) %*% x) %*% t(x) %*% y

This runs without a problem and produces the answer almost right away. What is it in the
function that is making this regression impossible to run and requires so much memory?

I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine?


Idr Idr

lm does much more than just find the coefficients for your input features. For example, it provides diagnostic statistics that tell you more about the coefficients on your independent variables including the standard error and t value of each of your independent variables.

I think that understanding these diagnostic statistics is important when running regressions to understand how valid your regression is.

These additional calculations cause lm to be slower than simply doing solving the matrix equations for the regression.

For example, using the mtcars dataset:

>lm_cars <- lm(mpg~., data=mtcars)

lm(formula = mpg ~ ., data = mtcars)                          

    Min      1Q  Median      3Q     Max                       
-3.4506 -1.6044 -0.1196  1.2193  4.6271                       

            Estimate Std. Error t value Pr(>|t|)              
(Intercept) 12.30337   18.71788   0.657   0.5181              
cyl         -0.11144    1.04502  -0.107   0.9161              
disp         0.01334    0.01786   0.747   0.4635              
hp          -0.02148    0.02177  -0.987   0.3350              
drat         0.78711    1.63537   0.481   0.6353              
wt          -3.71530    1.89441  -1.961   0.0633 .            
qsec         0.82104    0.73084   1.123   0.2739              
vs           0.31776    2.10451   0.151   0.8814              
am           2.52023    2.05665   1.225   0.2340              
gear         0.65541    1.49326   0.439   0.6652              
carb        -0.19942    0.82875  -0.241   0.8122              
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom        
Multiple R-squared: 0.869,      Adjusted R-squared: 0.8066    
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07