Hercules Apergis Hercules Apergis - 22 days ago 6
R Question

How to apply a regression in a for loop for all the variables of a dataset while adding rows in R

That is a long question I know, but bear with me.

I have a dataset in this form:

head(TRAINSET)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 Y
1 -2.973012 -2.956570 -2.386837 -0.5861751 4e-04 0.44 0.0728 0.0307 0.0354 0.0078 0.0047 0.0100 -0.0022 0.0038 -0.005200012
2 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351 0.0075 0.0028 0.0095 -0.0019 0.0000 0.042085781
3 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347 0.0088 0.0018 0.0092 -0.0019 -0.0076 0.004577122
4 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331 0.0253 0.0011 0.0092 -0.0170 -0.0076 0.010515970
5 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090 0.0060 -0.0058 0.058487141
6 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327 0.0109 -0.0006 0.0093 -0.0120 0.0000 -0.022896759


This is the Train set of mine, and it is 300 rows. The remaining 700 rows are the Test set. What I am trying to accomplish is:


  1. For each column fit a linear model of this form : Y ~ X1.

  2. Use the model created to get the predicted value of the Y by using the first X1 of the Test set.

  3. After that, take the first row of the Test set and rbind it to the Train set (now the Train set is 301 rows).

  4. Predict the value of Y using the 2nd row of X1 from the test set.

  5. Repeat for the remaining 699 rows of the Test set.

  6. Apply it for all the remaining variables of the datasets (X2,...,X14).



I have managed to produce the accurate results when I apply a code that i made for each variable specifically:

fittedvaluess<-NULL #empty set to fill
for(i in 1:nrow(TESTSET)){ #beggin iteration over the rows of Test set
TRAINSET<-rbind(TRAINSET,TESTSET[i,]) #add the rows to the train set
LM<-lm(Y~X1,TRAINSET) #fit the evergrowing OLS
predictd<-predict(LM,TESTSET[i+1,],type = "response") #get the predicted value
fittedvaluess<-cbind(fittedvaluess,predictd) #get the vector of the predicted values
print(cbind(i,length(TRAINSET$LHS),length(TRAINSET$DP),nrow(TRAINSET))) #to make sure it works
}


However, i want to automate this to go and repeat it over the columns. I have made this:

data<-TRAINSET #cause every time i had to remake the trainset
fittedvaluesss<-NULL
for(i in 1:nrow(TESTSET){ #begin iteration on rows of Testset
data<-rbind(data,TESTSET[i,]) # rbind the rows to the Trainset called data
for(j in 1:ncol(TESTSET){ #iterate over the collums
LM<-lm(data$LHS~data[,j],data) #fit OLS
predictd<-predict(LM,TESTSET[i+1,j],type = "response") #get the predicted value
fittedvaluesss<-cbind(fittedvaluesss,predictd) #derive the predicted value
print(c(i,j)) #make sure it works
}
}


The results are unfortunately wrong: the fittedvalues are a huge matrix :

dim(fittedvaluesss)
[1] 2306 3167 #Stopped around the middle of its run


Which doesn't make any sense. I have even run it for

i in 1:3
and
j in 1:3


and still the matrix was insanely huge. I have tried having the iteration starting from the columns and the go over the lines. Exactly the same wrong results. For some reason in each run i was getting at least 362 values from the PREDICT function. I am really stuck over this problem.

Any help is highly welcome.

EDIT 1: This is also known as a RECURSIVE FORECASTING methodology in Finance. It is a method to forecast future values from a model fit from your current dataset.

Answer

Consider reversing your looping logic with columns in outer loop and rows in inner loop. Additionally, try nested apply functions which returns structures more aligned to your needs than the for loop. Specifically, the inner vapply() returns a numeric vector of all testset's predicted values for each iterated column. Then the outer sapply() binds each returned vector to a column of a matrix.

Ultimately, fittedvaluess is a matrix with dimensions: TESTSET nrow X TESTSET ncol. Notice too, outer loop leaves out last column since you do not regress Y on Y.

fittedvaluess <- sapply(1:(ncol(TESTSET)-1), function(c){

  col <- names(TESTSET)[[c]]                     # RETRIEVE COLUMN NAME FOR LM FORMULA

  predictvals <- vapply(1:nrow(TESTSET), function(r){      
    TRAINSET <- rbind(TRAINSET, TESTSET[1:r,])   # BINDING ROWS ON AND PRIOR TO CURRENT ROW     
    LM <- lm(paste0("Y~", col), TRAINSET)        # CONCATENATED STRING FORMULA     
    predictd <- predict(LM, TESTSET[r+1,], type="response")
  }, numeric(1))

})