Algina Gri - 6 months ago 61

R Question

I am trying to write a function for solving multiple regression using QR decomposition. Input: y vector and X matrix; output: b, e, R^2. So far I`ve got this and am terribly stuck; I think I have made everything way too complicated:

`QR.regression <- function(y, X) {`

X <- as.matrix(X)

y <- as.vector(y)

p <- as.integer(ncol(X))

if (is.na(p)) stop("ncol(X) is invalid")

n <- as.integer(nrow(X))

if (is.na(n)) stop("nrow(X) is invalid")

nr <- length(y)

nc <- NCOL(X)

# Householder

for (j in seq_len(nc)) {

id <- seq.int(j, nr)

sigma <- sum(X[id, j]^2)

s <- sqrt(sigma)

diag_ej <- X[j, j]

gamma <- 1.0 / (sigma + abs(s * diag_ej))

kappa <- if (diag_ej < 0) s else -s

X[j,j] <- X[j, j] - kappa

if (j < nc)

for (k in seq.int(j+1, nc)) {

yPrime <- sum(X[id,j] * X[id,k]) * gamma

X[id,k] <- X[id,k] - X[id,j] * yPrime

}

yPrime <- sum(X[id,j] * y[id]) * gamma

y[id] <- y[id] - X[id,j] * yPrime

X[j,j] <- kappa

} # end of Householder transformation

rss <- sum(y[seq.int(nc+1, nr)]^2) # residuals sum of squares

e <- rss/nr

e <- mean(residuals(QR.regression)^2)

beta <- solve(t(X) %*% X, t(X) %*% y)

for (i in seq_len(ncol(X))) # set zeros in the lower triangular side of X

X[seq.int(i+1, nr),i] <- 0

Rsq <- (X[1:nc,1:nc])^2

return(list(Rsq=Rsq, y = y, beta = beta, e = e))

}

UPDATE:

my.QR <- function(y, X) {

X <- as.matrix(X)

y <- as.vector(y)

p <- as.integer(ncol(X))

if (is.na(p)) stop("ncol(X) is invalid")

n <- as.integer(nrow(X))

if (is.na(n)) stop("nrow(X) is invalid")

qr.X <- qr(X)

b <- solve(t(X) %*% X, t(X) %*% y)

e <- as.vector(y - X %*% beta) #e

R2 <- (X[1:p, 1:p])^2

return(list(b = b, e= e, R2 = R2 ))

}

X <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)

y <- c(1,2,3,4)

my.QR(X, y)

Answer

It all depends on how much of R's built-in facility you are allowed to use to solve this problem. I already know that `lm`

is not allowed, so here is the rest of the story.

**If you are allowed to use any other routines than lm**

Then you can simply use `lm.fit`

, `.lm.fit`

or `lsfit`

for QR-based ordinary least squares solving.

```
lm.fit(X, y)
.lm.fit(X, y)
lsfit(X, y, intercept = FALSE)
```

Among those, `.lm.fit`

is the most light-weighed, while `lm.fit`

and `lsfit`

are pretty similar. The following is what we can do via `.lm.fit`

:

```
f1 <- function (X, y) {
z <- .lm.fit(X, y)
RSS <- crossprod(z$residuals)[1]
TSS <- crossprod(y - mean(y))[1]
R2 <- 1 - RSS / TSS
list(coefficients = z$coefficients, residuals = z$residuals, R2 = R2)
}
```

In the question by your fellow classmate: Toy R function for solving ordinary least squares by singular value decomposition, I have already used this to check the correctness of SVD approach.

**If you are not allowed to use R's built-in QR factorization routine qr.default**

If `.lm.fit`

is not allowed, but `qr.default`

is, then it is also not that complicated.

```
f2 <- function (X, y) {
## QR factorization `X = QR`
QR <- qr.default(X)
## After rotation of `X` and `y`, solve upper triangular system `Rb = Q'y`
b <- backsolve(QR$qr, qr.qty(QR, y))
## residuals
e <- as.numeric(y - X %*% b)
## R-squared
RSS <- crossprod(e)[1]
TSS <- crossprod(y - mean(y))[1]
R2 <- 1 - RSS / TSS
## multiple return
list(coefficients = b, residuals = e, R2 = R2)
}
```

Note, if you further want variance-covariance of estimated coefficients, follow How to calculate variance of least squares estimator using QR decomposition in R?.

**If you are not even allowed to use qr.default**

Then we have to write QR decomposition ourselves. Writing a Householder QR factorization function in R code is giving this.

Using the function `myqr`

there, we can write

```
f3 <- function (X, y) {
## our own QR factorization
## complete Q factor is not required
QR <- myqr(X, complete = FALSE)
Q <- QR$Q
R <- QR$R
## rotation of `y`
Qty <- crossprod(Q, y)
## solving upper triangular system
b <- backsolve(R, Qty)
## residuals
e <- y - X %*% b
## R-squared
RSS <- crossprod(e)[1]
TSS <- crossprod(y - mean(y))[1]
R2 <- 1 - RSS / TSS
## multiple return
list(coefficients = b, residuals = e, R2 = R2)
}
```

`f3`

is not extremely efficient, as we have formed `Q`

explicitly, even though it is the thin-`Q`

factor. In principle, we should rotate `y`

along with the QR factorization of `X`

, thus `Q`

needs not be formed.

**If you want to fix your existing code**

This requires some debugging effort so would take some time. I will make another answer regarding this later.