Alex - 3 months ago 9
R Question

# Why does lm run out of memory while matrix multiplication works fine for coefficients?

I am trying to do fixed effects with R. My data looks like this

`dte, yr, id, v1, v2`
and has daily date values. I would like to include dummy variables for
`id`
and for
`yr`
where
`dte`
is the date. If i try to use
`plm`
and specify the index as
`index=c("id, "yr")`
and do a
`within`
model with any sort of
`effects`
I get the error that
`(yr, id)`
is not unique, which is true since my data is daily.

I then decided to simply do this by making
`yr`
a factor and using
`lm`
:

``````lm(v1 ~ factor(yr) + v2 - 1, data=df)
``````

However, this seems to run out of memory. I have 20 levels in my factor and
`df`
is 14 mil rows which takes about 2 gigs to store, I am running this on a machine with 22 gigs dedicated to this process. I then decided to try things the old fashioned way: create dummy variables for each of my years
`t1`
to
`t20`
by doing:

``````df\$t1 <- 1*(df\$yr==1)
df\$t2 <- 1*(df\$yr==2)
df\$t3 <- 1*(df\$yr==3)
``````

etc.

and simply compute:

``````solve(t(x) %*% x) %*% t(x) %*% y
``````

This runs without a problem and produces the answer almost right away. What is it in the
`lm`
function that is making this regression impossible to run and requires so much memory?

EDIT:
I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine?

Thanks

`lm` does much more than just find the coefficients for your input features. For example, it provides diagnostic statistics that tell you more about the coefficients on your independent variables including the standard error and t value of each of your independent variables.

I think that understanding these diagnostic statistics is important when running regressions to understand how valid your regression is.

These additional calculations cause `lm` to be slower than simply doing solving the matrix equations for the regression.

For example, using the `mtcars` dataset:

``````>data(mtcars)
>lm_cars <- lm(mpg~., data=mtcars)
>summary(lm_cars)

Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
Min      1Q  Median      3Q     Max
-3.4506 -1.6044 -0.1196  1.2193  4.6271

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337   18.71788   0.657   0.5181
cyl         -0.11144    1.04502  -0.107   0.9161
disp         0.01334    0.01786   0.747   0.4635
hp          -0.02148    0.02177  -0.987   0.3350
drat         0.78711    1.63537   0.481   0.6353
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739
vs           0.31776    2.10451   0.151   0.8814
am           2.52023    2.05665   1.225   0.2340
gear         0.65541    1.49326   0.439   0.6652
carb        -0.19942    0.82875  -0.241   0.8122
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869,      Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
``````