Q007 - 10 months ago 80

R Question

I have a dataframe with two columns

`var_1<-seq(1:252)`

var_2<-runif(1:252)*1000

my_new_df<-data.frame(var_1,var_2)

names(my_new_df)<-c("Time_values","Count")

train_poly_data<-my_new_df[1:150,c("Time_values","Count")] # training data set

valid_poly_data<-my_new_df[151:200,c("Time_values","Count")] # validation data set

test_poly_data<-my_new_df[201:252,c("Time_values","Count")] # test data set

#obtain a polymomial regression model with 20 Degrees

poly_tr<-lm(train_poly_data$Count ~ poly(train_poly_data$Time_values,degree=20,raw = TRUE))

summary(poly_tr)

#getting the following warnings

Warning messages:

1: 'newdata' had 50 rows but variables found have 150 rows

2: In predict.lm(poly_tr, valid_poly_data) :

prediction from a rank-deficient fit may be misleading

Here is what I need to do,

I need to split data frame in train, validation, test data sets

Next I want to use polynomial regression using the training data and validate it using the validation data

But I keep on getting the error, how would I resolve the issue, I am also interested in finding the optimal degree of the polynomial as I want to see whether the randomly picked polynomial degree of 20 is kinda correct?

Any suggestions or help to point out my mistake will be always welcome.

How do I fix this warning ? I do understand that the warning is thrown because we have 150 values in training data set and 50 in validation data set

Answer Source

The first warning will go away you need to convert the validation data to the same format as the training data before you run predict, to ensure that both the training / validation data have exactly the same set of regressors / predictor variables.

The 2nd warning will still be there, since you are fitting a very high degree polynomial, it's a rank-deficient fit (also it is highly likely to overfit your training data, so the model may not be generalizable / useful).

What you can do instead to reduce the overfitting / eliminate rank-deficiency is to fit a lower degree polynomial, in which case both the warnings will go away.

Try this to get rid of both the warnings:

```
my_new_df<-data.frame(var_1,var_2)
names(my_new_df)<-c("Time_values","Count")
n <- 10 # lower degree polynomial
# first generate all the polynomial regressors on the entire data
my_new_df <- cbind.data.frame(my_new_df[-1], poly(my_new_df$Time_values, degree=n, raw=TRUE))
names(my_new_df)[-1] <- paste0('X', names(my_new_df)[-1])
train_poly_data<-my_new_df[1:150,] # training data set
valid_poly_data<-my_new_df[151:200,] # validation data set
test_poly_data<-my_new_df[201:252,] # test data set
#obtain a polymomial regression model with n Degrees
poly_tr<-lm(Count ~ ., train_poly_data)
summary(poly_tr)
pred <- predict(poly_tr, newdata=valid_poly_data)
pred
# 151 152 153 154 155 156
# 796.5672 982.6862 1219.7434 1517.9844 1889.2235 2347.0258
```