Toni - 6 months ago 17

R Question

This is one of those questions for which there are probably a million ways to go about it that would make the actual answer irrelevant, but stubbornness gets in the way...

In trying to understand the application of time series, it is clear that de-trending the data makes predicting future values implausible. For instance, using the

`gtemp`

`astsa`

So I end up with an ARIMA model (right or wrong) for the de-trended data, which allows me to "forecast" 10 years ahead:

`fit = arima(gtemp, order = c(4, 1, 1))`

pred = predict(fit, n.ahead = 10)

and an OLS trend estimator based on the values since 1950:

`gtemp1 = window(gtemp, start = 1950, end = 2009)`

fit2 = lm(gtemp1 ~ I(1950:2009))

The issue is how to use

`predict()`

If I run

`predict(fit2, data.frame(I(2010:2019)))`

`predict(fit2)`

`'newdata' had 10 rows but variables found have 60 rows`

Answer

You need:

```
dat <- data.frame(year = 1950:2009, gtemp1 = as.numeric(gtemp1))
fit2 <- lm(gtemp1 ~ year, data = dat)
unname( predict(fit2, newdata = data.frame(year = 2010:2019)) )
# [1] 0.4928475 0.5037277 0.5146079 0.5254882 0.5363684 0.5472487 0.5581289
# [8] 0.5690092 0.5798894 0.5907697
```

Alternatively if you don't want to use `data`

argument in `lm`

, you need:

```
year <- 1950:2009
fit2 <- lm(gtemp1 ~ year)
unname( predict(fit2, newdata = data.frame(year = 2010:2019)) )
# [1] 0.4928475 0.5037277 0.5146079 0.5254882 0.5363684 0.5472487 0.5581289
# [8] 0.5690092 0.5798894 0.5907697
```

**Why your original code fails**

When you do `fit2 <- lm(gtemp1 ~ I(1950:2009))`

, `lm`

is assuming there is a covariate called `I(1950:2009)`

:

```
attr(fit2$terms, "term.labels") ## names of covariates
# [1] "I(1950:2009)"
```

When you make prediction later, `predict`

will aim to find a variable in your new data frame, called `I(1950:2009)`

. However, have a look at the column names of your `newdata`

:

```
colnames( data.frame(I(2010:2019)) )
# [1] "X2010.2019"
```

As a result, `predict.lm`

can not find variable `I(1950:2009)`

in `newdata`

, then it would use the internal model frame `fit2$model`

as `newdata`

, and return fitted values by default (which explains why you get 60 values).