JohnnyDeer - 1 year ago 169
R Question

# Rolling regression and prediction with lm() and predict()

I need to apply

`lm()`
to an enlarging subset of my dataframe
`dat`
, while making prediction for the next observation. For example, I am doing:

``````fit model      predict
----------     -------
dat[1:3, ]     dat[4, ]
dat[1:4, ]     dat[5, ]
.             .
.             .
dat[-1, ]      dat[nrow(dat), ]
``````

I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do

``````dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]

fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
``````

How can I do this automatically for all subsets, and potentially extract what I want into a table?

• From
`fit`
, I'd need the
`summary(fit)\$adj.r.squared`
;

• From
`predict.fit`
I'd need
`predict.fit\$fit`
value.

Thanks.

(Efficient) solution

This is what you can do:

``````p <- 3  ## number of parameters in lm()
n <- nrow(dat) - 1

## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
}

## rolling regression / prediction
result <- t(sapply(p:n, bundle))
``````

Note I have done several things inside the `bundle` function:

• I have used `subset` argument for selecting a subset to fit
• I have used `model = FALSE` to not save model frame hence we save workspace

Overall, there is no obvious loop, but `sapply` is used.

• Fitting starts from `p`, the minimum number of data required to fit a model with `p` coefficients;
• Fitting terminates at `nrow(dat) - 1`, as we at least need the final column for prediction.

Test

Example data (with 30 "observations")

``````dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
``````

Applying code above gives `results` (27 rows in total, truncated output for 5 rows)

``````            adj.r2 prediction        se
[1,]          NaN   3.881068       NaN
[2,]  0.106592619   3.676821 0.7517040
[3,]  0.545993989   3.892931 0.2758347
[4,]  0.622612495   3.766101 0.1508270
[5,]  0.180462206   3.996344 0.2059014
``````

The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for `adj.r2` is `NaN`, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to `se` as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download