Alex Alex - 3 months ago 9
R Question

Why does lm run out of memory while matrix multiplication works fine for coefficients?

I am trying to do fixed effects with R. My data looks like this

dte, yr, id, v1, v2
and has daily date values. I would like to include dummy variables for
id
and for
yr
where
dte
is the date. If i try to use
plm
and specify the index as
index=c("id, "yr")
and do a
within
model with any sort of
effects
I get the error that
(yr, id)
is not unique, which is true since my data is daily.

I then decided to simply do this by making
yr
a factor and using
lm
:

lm(v1 ~ factor(yr) + v2 - 1, data=df)


However, this seems to run out of memory. I have 20 levels in my factor and
df
is 14 mil rows which takes about 2 gigs to store, I am running this on a machine with 22 gigs dedicated to this process. I then decided to try things the old fashioned way: create dummy variables for each of my years
t1
to
t20
by doing:

df$t1 <- 1*(df$yr==1)
df$t2 <- 1*(df$yr==2)
df$t3 <- 1*(df$yr==3)


etc.

and simply compute:

solve(t(x) %*% x) %*% t(x) %*% y


This runs without a problem and produces the answer almost right away. What is it in the
lm
function that is making this regression impossible to run and requires so much memory?

EDIT:
I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine?

Thanks

Idr Idr
Answer

lm does much more than just find the coefficients for your input features. For example, it provides diagnostic statistics that tell you more about the coefficients on your independent variables including the standard error and t value of each of your independent variables.

I think that understanding these diagnostic statistics is important when running regressions to understand how valid your regression is.

These additional calculations cause lm to be slower than simply doing solving the matrix equations for the regression.

For example, using the mtcars dataset:

>data(mtcars)
>lm_cars <- lm(mpg~., data=mtcars)
>summary(lm_cars)

Call:                                                         
lm(formula = mpg ~ ., data = mtcars)                          

Residuals:                                                    
    Min      1Q  Median      3Q     Max                       
-3.4506 -1.6044 -0.1196  1.2193  4.6271                       

Coefficients:                                                 
            Estimate Std. Error t value Pr(>|t|)              
(Intercept) 12.30337   18.71788   0.657   0.5181              
cyl         -0.11144    1.04502  -0.107   0.9161              
disp         0.01334    0.01786   0.747   0.4635              
hp          -0.02148    0.02177  -0.987   0.3350              
drat         0.78711    1.63537   0.481   0.6353              
wt          -3.71530    1.89441  -1.961   0.0633 .            
qsec         0.82104    0.73084   1.123   0.2739              
vs           0.31776    2.10451   0.151   0.8814              
am           2.52023    2.05665   1.225   0.2340              
gear         0.65541    1.49326   0.439   0.6652              
carb        -0.19942    0.82875  -0.241   0.8122              
---                                                           
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom        
Multiple R-squared: 0.869,      Adjusted R-squared: 0.8066    
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07       
Comments