FakeBrain FakeBrain - 1 year ago 56
R Question

concatenate multiple steps in r function for computing beta value

I'm new to r (but I mainly code in python), and I'm trying to write down the code for simple linear regression for my own understanding, and I'm at the point of estimating beta1

given a predictor column

and a response
I want to do in pseudo code:

sum((x[i] - mean(x)) * (y[i] - mean(y)) / sum(x[i] - mean(x))^2

so in r:

m <- rbind(c(2,3),c(1,2),c(0,3)))

Since i've read that for loops are the devil.... I thought maybe I could do something like:

beta1 <- function(x, y){
c <- cbind(x,y)
b1 <- apply(c, 2, function(v) v - mean(v))
b1 <- b1[,1] * b1[,2]
b1top <- sum(b1)
b1bottom <- sum((x - mean(x))^2)
b1 <- b1top / b1bottom

[1] 0

Now, putting aside that the implementation might be wrong to start....what are some ways of shortening the amount of work, in terms of lines of code inside the function?

Answer Source

You're right in that for loops are bad. Your approach is already pretty fast given you're adopting a vectorized approach whenever possible (i.e. you're treating x as a vector and just doing scalar subtraction by subtracting the mean of x from each element, instead of looping through it manually and subtracting within the loop). Thats what youre doing in the second half of your code anyway

You can shorten the first half of your code in the same way

So instead of applying the function to subtract the mean, you can do it directly (e.g. x - mean(x)). Which means your numerator can be calculated like this:

b1Top <- sum((x - mean(x)) * (y - mean(y)))
b1bottom <- sum((x - mean(x))^2)
b1 <- b1top / b1bottom

That method is going to get a bit heavy if you have more than 1 predictor though. Theres another way to calculate regression weights using a vectorized approach (using matrices).

Regression weights can be calculated entirely using matrix operations from raw data. Weights are given by:


Where X is your predictor variable matrix, and Y is your response variable

First we need to create the predictor/design matrix by taking your predictors and adding a 1's column for the intercept:

xData <- data.frame(1, x)
designMatrix <- data.matrix(xData)

Next we calculate the sum of squares cross products matrix (X'X):

SSCP <- t(designMatrix) %*% designMatrix

Then invert it:

inverseSSCP <- solve(SSCP)

Multiply that by the transpose of the design matrix:

inverseMult <- inverseSSCP %*% t(designMatrix)

Finally multiply that by your Y vector:

betas <- inverseMult %*% y

The original approach doesnt extend well into multiple predictors as you'll start to lose the vectorized power of R, so when you get to that stage you're going to end up writing many more lines of code. The matrix approach allows you to calculate regression weights for all of your predictors in one go, regardless of how many there are.