CCID CCID - 3 months ago 14
R Question

Variance and confidence intervals for summed predictions from linear model in R

I'm summing the predicted values from a linear model with multiple predictors, as in the example below, and want to calculate the combined variance, standard error and confidence intervals for this sum.

Here is an example with built-in data set

head(trees)

lm.tree <- lm(Volume ~ poly(Girth,2), data = trees)
summary(lm.tree)


Suppose I have a set of girths (10,12,14,16) for which I want to predict the total volume:

pr <- predict(lm.tree, newdata = list(Girth = c(10,12,14,16)), type = "response", se.fit = T)

sum(pr$fit)
# [1] 111.512


Similar questions are here (for GAMs), but I'm not sure how to proceed with the
vcov(lm.trees)
.

I'd be grateful for a reference for the method.

Answer

Basically you need to obtain full variance-covariance matrix, then sum all its elements. Here is small proof:

enter image description here

The proof here is using another theorem, which you can find from Covariance-wikipedia:

enter image description here

Specifically, the linear transform we take is a column matrix of all 1's. The resulting quadratic form is computed as following, with all "a_ij" being 1.

enter image description here


Obtain full variance-covariance matrix for prediction

The most trivial way to obtain variance-covariance matrix, is based on vcov. For your fitted model lm.tree, prediction data newdat and prediction matrix Xp:

## your model
lm.tree <- lm(Volume ~ poly(Girth,2), data = trees)

## newdata
newdat <- list(Girth = c(10,12,14,16))

## prediction matrix
poly_Xp <- poly(unlist(newdat), degree = 2,
           coefs = attr(lm.tree$model$"poly(Girth, 2)", "coefs"),
           simple = TRUE)
Xp <- cbind(1, poly_Xp)

#                 1          2
#[1,] 1 -0.18898835  0.1098920
#[2,] 1 -0.07263008 -0.1036759
#[3,] 1  0.04372819 -0.1776927
#[4,] 1  0.16008645 -0.1121585

We can do:

VCOV1 <- Xp %*% vcov(lm.tree) %*% t(Xp)
#            [,1]      [,2]       [,3]       [,4]
#[1,]  0.89022938 0.3846809 0.04967582 -0.1147858
#[2,]  0.38468089 0.5369327 0.52828797  0.3587467
#[3,]  0.04967582 0.5282880 0.73113553  0.6582185
#[4,] -0.11478583 0.3587467 0.65821848  0.7836294

But a much more efficient way is via my function:

fast_vcov <- function (lmObject, Xp, diag = TRUE) {
  ## compute Qt
  QR <- lmObject$qr   ## qr object of fitted model
  piv <- QR$pivot    ## pivoting index
  r <- QR$rank    ## model rank / numeric rank
  if (is.unsorted(piv)) {
    ## pivoting has been done
    Qt <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
    } else {
    ## no pivoting is done
    Qt <- forwardsolve(t(QR$qr), t(Xp), r)
    }
  ## residual variance
  sigma2 <- sum(residuals(lmObject)^2) / df.residual(lmObject)
  ## return
  if (diag) {
    ## return point-wise prediction variance
    ## no need to compute full hat matrix
    return(colSums(Qt ^ 2) * sigma2)
    } else {
    ## return full variance-covariance matrix of predicted values
    return(crossprod(Qt) * sigma2)
    }
  }

With diag = TRUE, only diagonals of the variance-covariance are returned. These are point-wise variance, sufficient to produce point-wise confidence or prediction interval. Note, computing these diagonals do not need explicit formation of full hat matrix.

With diag = FALSE, full hat matrix will be returned by a more expensive matrix multiplication. But it is still faster than the trivial way based on vcov.

Here, we want:

## full variance-covariance matrix
VCOV <- fast_vcov(lm.tree, Xp, diag = FALSE)
#            [,1]      [,2]       [,3]       [,4]
#[1,]  0.89022938 0.3846809 0.04967582 -0.1147858
#[2,]  0.38468089 0.5369327 0.52828797  0.3587467
#[3,]  0.04967582 0.5282880 0.73113553  0.6582185
#[4,] -0.11478583 0.3587467 0.65821848  0.7836294

As you can see, it is exactly as same as VCOV1 obtained from the trivial method.

Obtaining predicted values is easy:

pred <- drop(Xp %*% coef(lm.tree))

In case you feel interested, you can compare those values with what is returned by predict.lm:

pr <- predict(lm.tree, newdat, se.fit = TRUE)

pr$fit
#       1        2        3        4 
#15.31863 22.33400 31.38568 42.47365 

pred
#       1        2        3        4 
#15.31863 22.33400 31.38568 42.47365 

pr$se.fit
#        1         2         3         4 
#0.9435197 0.7327569 0.8550646 0.8852284 

sqrt(diag(VCOV))
# 0.9435197 0.7327569 0.8550646 0.8852284

So, we are definitely doing correct.


Finally, the sum of all predicted values is

sum(pred)
# 111.512

with variance

sum(VCOV)
# 6.671575

Standard error is just its square root of variance.