Hercules Apergis - 7 months ago 41

R Question

That is a long question I know, but bear with me.

I have a dataset in this form:

`head(TRAINSET)`

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 Y

1 -2.973012 -2.956570 -2.386837 -0.5861751 4e-04 0.44 0.0728 0.0307 0.0354 0.0078 0.0047 0.0100 -0.0022 0.0038 -0.005200012

2 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351 0.0075 0.0028 0.0095 -0.0019 0.0000 0.042085781

3 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347 0.0088 0.0018 0.0092 -0.0019 -0.0076 0.004577122

4 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331 0.0253 0.0011 0.0092 -0.0170 -0.0076 0.010515970

5 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090 0.0060 -0.0058 0.058487141

6 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327 0.0109 -0.0006 0.0093 -0.0120 0.0000 -0.022896759

This is the Train set of mine, and it is 300 rows. The remaining 700 rows are the Test set. What I am trying to accomplish is:

- For each column fit a linear model of this form : Y ~ X1.
- Use the model created to get the predicted value of the Y by using the first X1 of the Test set.
- After that, take the first row of the Test set and rbind it to the Train set (now the Train set is 301 rows).
- Predict the value of Y using the 2nd row of X1 from the test set.
- Repeat for the remaining 699 rows of the Test set.
- Apply it for all the remaining variables of the datasets (X2,...,X14).

I have managed to produce the accurate results when I apply a code that i made for each variable specifically:

`fittedvaluess<-NULL #empty set to fill`

for(i in 1:nrow(TESTSET)){ #beggin iteration over the rows of Test set

TRAINSET<-rbind(TRAINSET,TESTSET[i,]) #add the rows to the train set

LM<-lm(Y~X1,TRAINSET) #fit the evergrowing OLS

predictd<-predict(LM,TESTSET[i+1,],type = "response") #get the predicted value

fittedvaluess<-cbind(fittedvaluess,predictd) #get the vector of the predicted values

print(cbind(i,length(TRAINSET$LHS),length(TRAINSET$DP),nrow(TRAINSET))) #to make sure it works

}

However, i want to automate this to go and repeat it over the columns. I have made this:

`data<-TRAINSET #cause every time i had to remake the trainset`

fittedvaluesss<-NULL

for(i in 1:nrow(TESTSET){ #begin iteration on rows of Testset

data<-rbind(data,TESTSET[i,]) # rbind the rows to the Trainset called data

for(j in 1:ncol(TESTSET){ #iterate over the collums

LM<-lm(data$LHS~data[,j],data) #fit OLS

predictd<-predict(LM,TESTSET[i+1,j],type = "response") #get the predicted value

fittedvaluesss<-cbind(fittedvaluesss,predictd) #derive the predicted value

print(c(i,j)) #make sure it works

}

}

The results are unfortunately wrong: the fittedvalues are a huge matrix :

`dim(fittedvaluesss)`

[1] 2306 3167 #Stopped around the middle of its run

Which doesn't make any sense. I have even run it for

`i in 1:3`

and

j in 1:3

and still the matrix was insanely huge. I have tried having the iteration starting from the columns and the go over the lines. Exactly the same wrong results. For some reason in each run i was getting at least 362 values from the PREDICT function. I am really stuck over this problem.

Any help is highly welcome.

EDIT 1: This is also known as a RECURSIVE FORECASTING methodology in Finance. It is a method to forecast future values from a model fit from your current dataset.

Answer

Consider reversing your looping logic with columns in outer loop and rows in inner loop. Additionally, try nested apply functions which returns structures more aligned to your needs than the `for`

loop. Specifically, the inner `vapply()`

returns a numeric vector of all testset's predicted values for each iterated column. Then the outer `sapply()`

binds each returned vector to a column of a matrix.

Ultimately, `fittedvaluess`

is a matrix with dimensions: `TESTSET nrow X TESTSET ncol`

. Notice too, outer loop leaves out last column since you do not regress Y on Y.

```
fittedvaluess <- sapply(1:(ncol(TESTSET)-1), function(c){
col <- names(TESTSET)[[c]] # RETRIEVE COLUMN NAME FOR LM FORMULA
predictvals <- vapply(1:nrow(TESTSET), function(r){
TRAINSET <- rbind(TRAINSET, TESTSET[1:r,]) # BINDING ROWS ON AND PRIOR TO CURRENT ROW
LM <- lm(paste0("Y~", col), TRAINSET) # CONCATENATED STRING FORMULA
predictd <- predict(LM, TESTSET[r+1,], type="response")
}, numeric(1))
})
```

Source (Stackoverflow)