I would like to fit a (very) high order regression to a set of data in R, however the
model <- lm(noisy.y ~ poly(q,50))
# Error in poly(q, 50) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,30))
# Error in poly(q, 30) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,25))
Polynomials and orthogonal polynomials
poly(x) has no hard-coded limit for
degree. However, there are two numerical constraints in practice.
Basis functions are constructed on unique location of
x values. A polynomial of degree
k + 1 basis and coefficients.
poly generates basis without the intercept term, so
degree = k implies
k basis and
k coefficients. If there are
x values, it must be satisfied that
k <= n, otherwise there is simply insufficient information to construct a polynomial. Inside
poly(), the following line checks this condition:
if (degree >= length(unique(x))) stop("'degree' must be less than number of unique points")
x ^ k and
x ^ (k+1) is getting closer and closer to 1 as
k increases. Such approaching speed is of course dependent on
poly first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between
x ^ k and
x ^ (k+1),
poly will also stop and complain:
if (QR$rank < degree) stop("'degree' must be less than number of unique points")
But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then
poly can reset
rank to proceed. Maybe R core can improve on this bit??
Your trial-and-error shows that you can't construct a polynomial of more than 25 degree, but the reason can be both. If you really want to know which one it is, you need a debugger to step through
poly. But what I want to express, is that a polynomial of more than 3-5 degree is never useful! The critical reason is the Runge's phenomenon. In terms of statistical terminology: a high-order polynomial always badly overfits data!. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, a polynomial of degree
k forms a vector space, so whatever basis you use for representation, they have the same span!
Splines: piecewise cubic polynomials and its use in regression
Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:
B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix
X'X is banded, thus solving normal equations
(X'X) b = (X'y) are very stable.
In R, we can use
bs function from
splines package (one of R base packages) to get B-spline basis. For
bs(x), The only numerical constraint on degree of freedom
df is that we can't have more basis than
I am not sure of what your data look like, but perhaps you can try
library(splines) model <- lm(noisy.y ~ bs(q, df = 10))
Penalized smoothing / regression splines
Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.
A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.
smooth.spline function in
stats package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read
?smooth.spline for more. For your data, you may try
fit <- smooth.spline(q, noisy.y)
smooth.spline has no formula interface.)
Additive penalized splines and Generalized Additive Models (GAM)
Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the
mgcv package, recommended by R.
You can still fit a univariate penalized regression spline with
library(mgcv) fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))