Algina Gri - 7 months ago 85
R Question

# Multiple regression analysis in R using QR decomposition

I am trying to write a function for solving multiple regression using QR decomposition. Input: y vector and X matrix; output: b, e, R^2. So far I`ve got this and am terribly stuck; I think I have made everything way too complicated:

``````QR.regression <- function(y, X) {
X <- as.matrix(X)
y <- as.vector(y)
p <- as.integer(ncol(X))
if (is.na(p)) stop("ncol(X) is invalid")
n <- as.integer(nrow(X))
if (is.na(n)) stop("nrow(X) is invalid")
nr <- length(y)
nc <- NCOL(X)

# Householder
for (j in seq_len(nc)) {
id <- seq.int(j, nr)
sigma <- sum(X[id, j]^2)
s <- sqrt(sigma)
diag_ej <- X[j, j]
gamma <- 1.0 / (sigma + abs(s * diag_ej))
kappa <- if (diag_ej < 0) s else -s
X[j,j] <- X[j, j] - kappa
if (j < nc)
for (k in seq.int(j+1, nc)) {
yPrime <- sum(X[id,j] * X[id,k]) * gamma
X[id,k] <- X[id,k] - X[id,j] * yPrime
}
yPrime <- sum(X[id,j] * y[id]) * gamma
y[id] <- y[id] - X[id,j] * yPrime
X[j,j] <- kappa
} # end of Householder transformation

rss <- sum(y[seq.int(nc+1, nr)]^2)  # residuals sum of squares
e <- mean(residuals(QR.regression)^2)
beta <- solve(t(X) %*% X, t(X) %*% y)
for (i in seq_len(ncol(X))) # set zeros in the lower triangular side of X
X[seq.int(i+1, nr),i] <- 0
Rsq <- (X[1:nc,1:nc])^2
return(list(Rsq=Rsq, y = y, beta = beta, e = e))
}

UPDATE:
my.QR <- function(y, X) {
X <- as.matrix(X)
y <- as.vector(y)
p <- as.integer(ncol(X))
if (is.na(p)) stop("ncol(X) is invalid")
n <- as.integer(nrow(X))
if (is.na(n)) stop("nrow(X) is invalid")
qr.X <- qr(X)
b <- solve(t(X) %*% X, t(X) %*% y)
e <- as.vector(y - X %*% beta) #e
R2 <- (X[1:p, 1:p])^2
return(list(b = b, e= e, R2 = R2 ))
}

X <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
y <- c(1,2,3,4)
my.QR(X, y)
``````

It all depends on how much of R's built-in facility you are allowed to use to solve this problem. I already know that `lm` is not allowed, so here is the rest of the story.

If you are allowed to use any other routines than `lm`

Then you can simply use `lm.fit`, `.lm.fit` or `lsfit` for QR-based ordinary least squares solving.

``````lm.fit(X, y)
.lm.fit(X, y)
lsfit(X, y, intercept = FALSE)
``````

Among those, `.lm.fit` is the most light-weighed, while `lm.fit` and `lsfit` are pretty similar. The following is what we can do via `.lm.fit`:

``````f1 <- function (X, y) {
z <- .lm.fit(X, y)
TSS <- crossprod(y - mean(y))[1]
R2 <- 1 - RSS / TSS
list(coefficients = z\$coefficients, residuals = z\$residuals, R2 = R2)
}
``````

In the question by your fellow classmate: Toy R function for solving ordinary least squares by singular value decomposition, I have already used this to check the correctness of SVD approach.

If you are not allowed to use R's built-in QR factorization routine `qr.default`

If `.lm.fit` is not allowed, but `qr.default` is, then it is also not that complicated.

``````f2 <- function (X, y) {
## QR factorization `X = QR`
QR <- qr.default(X)
## After rotation of `X` and `y`, solve upper triangular system `Rb = Q'y`
b <- backsolve(QR\$qr, qr.qty(QR, y))
## residuals
e <- as.numeric(y - X %*% b)
## R-squared
TSS <- crossprod(y - mean(y))[1]
R2 <- 1 - RSS / TSS
## multiple return
list(coefficients = b, residuals = e, R2 = R2)
}
``````

Note, if you further want variance-covariance of estimated coefficients, follow How to calculate variance of least squares estimator using QR decomposition in R?.

If you are not even allowed to use `qr.default`

Then we have to write QR decomposition ourselves. Writing a Householder QR factorization function in R code is giving this.

Using the function `myqr` there, we can write

``````f3 <- function (X, y) {
## our own QR factorization
## complete Q factor is not required
QR <- myqr(X, complete = FALSE)
Q <- QR\$Q
R <- QR\$R
## rotation of `y`
Qty <- crossprod(Q, y)
## solving upper triangular system
b <- backsolve(R, Qty)
## residuals
e <- y - X %*% b
## R-squared
`f3` is not extremely efficient, as we have formed `Q` explicitly, even though it is the thin-`Q` factor. In principle, we should rotate `y` along with the QR factorization of `X`, thus `Q` needs not be formed.