Alex - 9 months ago 29

R Question

I am trying to do fixed effects with R. My data looks like this

`dte, yr, id, v1, v2`

`id`

`yr`

`dte`

`plm`

`index=c("id, "yr")`

`within`

`effects`

`(yr, id)`

I then decided to simply do this by making

`yr`

`lm`

`lm(v1 ~ factor(yr) + v2 - 1, data=df)`

However, this seems to run out of memory. I have 20 levels in my factor and

`df`

`t1`

`t20`

`df$t1 <- 1*(df$yr==1)`

df$t2 <- 1*(df$yr==2)

df$t3 <- 1*(df$yr==3)

etc.

and simply compute:

`solve(t(x) %*% x) %*% t(x) %*% y`

This runs without a problem and produces the answer almost right away. What is it in the

`lm`

EDIT:

I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine?

Thanks

Answer

`lm`

does much more than just find the coefficients for your input features. For example, it provides diagnostic statistics that tell you more about the coefficients on your independent variables including the standard error and t value of each of your independent variables.

I think that understanding these diagnostic statistics is important when running regressions to understand how valid your regression is.

These additional calculations cause `lm`

to be slower than simply doing solving the matrix equations for the regression.

For example, using the `mtcars`

dataset:

```
>data(mtcars)
>lm_cars <- lm(mpg~., data=mtcars)
>summary(lm_cars)
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
```

Source (Stackoverflow)