dimebucker91 dimebucker91 - 2 months ago 7
R Question

Regression Subset Algorithm in R

I want to build a regression subset algorithm in R for a 'beta regression' model.
There is a package

in R which fits beta regressions, and what I am interested in is the model that maximises the 'log likelihood'.

Basically, this works by picking the best k factor regression model, for k = 1,2,...,p where p is the number of variables you have.

For example, if i have x_1, x_2, x_3 as my variables, and y as my response. I want to have something that does:

step 1: Find best 1 factor model

mod1 <- betareg(y~x_1, data = test)
mod1.sum <- summary(mod1)

mod2 <- betareg(y~x_2, data = test)
mod2.sum <- summary(mod2)

mod3 <- betareg(y~x_3, data = test)
mod3.sum <- summary(mod3)

now that i have fit all the models, I want to compare the log likelihood of each:

likelihoods <- c( mod1.sum$loglik, mod2.sum$loglik, mod3.sum$loglik)

Step 2: find the best factor to add to the best 1 factor model, let's assume x_1 was the best in the previous step. Then in this step we compare the model with x_1 and x_2, to the model with x_1 and x_3 choosing the one with the biggest loglikelihood.

Step 3: taking the best two variables as a given, find the third variable contributing the largest increase to log likelihood.

Step 4: Return the best 1 factor model, best 2 factor model, ..., best p factor model, the factors included and their corresponding log likelihoods.

I am struggling to do this efficiently when p is large, say around 40


In addition to my other answer that shows how to do best-subset selection for beta regression with kofnGA, I'm including an example how to do forward selection by hand.

We again start by loading the package and data:

data("FoodExpenditure", package = "betareg")

I'm also setting up lists with the response plus all regressors (including 40 random ones. (Note that unlike in the other I'm keeping the intercept in x which is more convenient here.)

fe_data <- list(
  y = with(FoodExpenditure, food/income),
  x = model.matrix(~ income + persons, data = FoodExpenditure)
fe_data$x <- cbind(fe_data$x,
  matrix(rnorm(40 * nrow(fe_data$x)), ncol = 40))
colnames(fe_data$x)[4:43] <- paste0("x", 1:40)

Then we set up again a function for the negative log-likelihood (but do not need to include the intercept manually because it is still in x).

nll <- function(v, data) -betareg.fit(x = data$x[, v, drop = FALSE],
  y = data$y)$loglik

Then we store the index of all possible regressors vall and initialize our search with the intercept (v <- 1) and the corresponding negative log-likelihood (n).

vall <- 1:ncol(fe_data$x)
v <- 1
n <- nll(v, data = fe_data)

And then we iterate our forward search for 15 iterations (to avoid numerical instabilities on this small data set for higher numbers of variables). We always select that additional variable that decreases the negative log-likelihood most:

for(i in 1:15) {
  vi <- vall[-v]
  ni <- sapply(vi, function(vii) nll(v = c(v, vii), data = fe_data))
  v <- c(v, vi[which.min(ni)])
  n <- c(n, ni[which.min(ni)])

The sequence in which the variables are chosen is the following. Note that the real regressors are picked first, followed by the random noise regressors. (But try to set.seed(1) above which will include random regressors before the real ones...)

##  [1] "(Intercept)" "income"      "persons"     "x28"         "x18"        
##  [6] "x29"         "x22"         "x11"         "x5"          "x8"         
## [11] "x38"         "x24"         "x13"         "x23"         "x36"        
## [16] "x16"        

The corresponding decrease in the negative log-likelihood and associated BIC can be visualized as:

m <- seq_along(v)
plot(m, n, type = "b",
  xlab = "Number of regressors", ylab = "Log-likelihood")
plot(m, n + log(nrow(fe_data$x)) * (m + 1), type = "b",
  xlab = "Number of regressors", ylab = "BIC")

So this would indeed pick the model with the three real regressors as the best BIC model.