Toni Toni - 3 months ago 7
R Question

Predicting values in OLS linear model with indexed explanatory variable (years)

This is one of those questions for which there are probably a million ways to go about it that would make the actual answer irrelevant, but stubbornness gets in the way...

In trying to understand the application of time series, it is clear that de-trending the data makes predicting future values implausible. For instance, using the

gtemp
dataset from the
astsa
package, the trend upward in the past decades needs to be factored in:

enter image description here

So I end up with an ARIMA model (right or wrong) for the de-trended data, which allows me to "forecast" 10 years ahead:

fit = arima(gtemp, order = c(4, 1, 1))
pred = predict(fit, n.ahead = 10)


and an OLS trend estimator based on the values since 1950:

gtemp1 = window(gtemp, start = 1950, end = 2009)
fit2 = lm(gtemp1 ~ I(1950:2009))


The issue is how to use
predict()
to get the estimated value for the linear model part in the next 10 years.

If I run
predict(fit2, data.frame(I(2010:2019)))
I get the 60 values that I would get just running
predict(fit2)
, plus an error message:
'newdata' had 10 rows but variables found have 60 rows
.

Answer

You need:

dat <- data.frame(year = 1950:2009, gtemp1 = as.numeric(gtemp1))
fit2 <- lm(gtemp1 ~ year, data = dat)
unname( predict(fit2, newdata = data.frame(year = 2010:2019)) )
# [1] 0.4928475 0.5037277 0.5146079 0.5254882 0.5363684 0.5472487 0.5581289
# [8] 0.5690092 0.5798894 0.5907697

Alternatively if you don't want to use data argument in lm, you need:

year <- 1950:2009
fit2 <- lm(gtemp1 ~ year)
unname( predict(fit2, newdata = data.frame(year = 2010:2019)) )
# [1] 0.4928475 0.5037277 0.5146079 0.5254882 0.5363684 0.5472487 0.5581289
# [8] 0.5690092 0.5798894 0.5907697

Why your original code fails

When you do fit2 <- lm(gtemp1 ~ I(1950:2009)), lm is assuming there is a covariate called I(1950:2009):

attr(fit2$terms, "term.labels")  ## names of covariates
# [1] "I(1950:2009)"

When you make prediction later, predict will aim to find a variable in your new data frame, called I(1950:2009). However, have a look at the column names of your newdata:

colnames( data.frame(I(2010:2019)) )
# [1] "X2010.2019"

As a result, predict.lm can not find variable I(1950:2009) in newdata, then it would use the internal model frame fit2$model as newdata, and return fitted values by default (which explains why you get 60 values).