Q007 Q007 - 7 days ago 5
R Question

Error using Predict with Polynomial regressions in R

I have a dataframe with two columns

var_1<-seq(1:252)
var_2<-runif(1:252)*1000

my_new_df<-data.frame(var_1,var_2)
names(my_new_df)<-c("Time_values","Count")

train_poly_data<-my_new_df[1:150,c("Time_values","Count")] # training data set
valid_poly_data<-my_new_df[151:200,c("Time_values","Count")] # validation data set

test_poly_data<-my_new_df[201:252,c("Time_values","Count")] # test data set

#obtain a polymomial regression model with 20 Degrees
poly_tr<-lm(train_poly_data$Count ~ poly(train_poly_data$Time_values,degree=20,raw = TRUE))
summary(poly_tr)

#getting the following warnings
Warning messages:
1: 'newdata' had 50 rows but variables found have 150 rows
2: In predict.lm(poly_tr, valid_poly_data) :
prediction from a rank-deficient fit may be misleading


Here is what I need to do,

I need to split data frame in train, validation, test data sets
Next I want to use polynomial regression using the training data and validate it using the validation data

But I keep on getting the error, how would I resolve the issue, I am also interested in finding the optimal degree of the polynomial as I want to see whether the randomly picked polynomial degree of 20 is kinda correct?

Any suggestions or help to point out my mistake will be always welcome.

How do I fix this warning ? I do understand that the warning is thrown because we have 150 values in training data set and 50 in validation data set

Answer

The first warning will go away you need to convert the validation data to the same format as the training data before you run predict, to ensure that both the training / validation data have exactly the same set of regressors / predictor variables.

The 2nd warning will still be there, since you are fitting a very high degree polynomial, it's a rank-deficient fit (also it is highly likely to overfit your training data, so the model may not be generalizable / useful).

What you can do instead to reduce the overfitting / eliminate rank-deficiency is to fit a lower degree polynomial, in which case both the warnings will go away.

Try this to get rid of both the warnings:

my_new_df<-data.frame(var_1,var_2)
names(my_new_df)<-c("Time_values","Count") 

n <- 10 # lower degree polynomial
# first generate all the polynomial regressors on the entire data
my_new_df <- cbind.data.frame(my_new_df[-1], poly(my_new_df$Time_values, degree=n, raw=TRUE))
names(my_new_df)[-1] <- paste0('X', names(my_new_df)[-1])

train_poly_data<-my_new_df[1:150,] # training data set
valid_poly_data<-my_new_df[151:200,] # validation data set

test_poly_data<-my_new_df[201:252,] # test data set

#obtain a polymomial regression model with n Degrees
poly_tr<-lm(Count ~ ., train_poly_data)
summary(poly_tr)
pred <- predict(poly_tr, newdata=valid_poly_data)
pred


 # 151          152          153          154          155          156           
 # 796.5672     982.6862    1219.7434    1517.9844    1889.2235    2347.0258 
Comments