CenturionSC2 - 4 months ago 23

R Question

I would like to fit a (very) high order regression to a set of data in R, however the

`poly()`

For this application I need an order on the range of 100 to 120.

`model <- lm(noisy.y ~ poly(q,50))`

# Error in poly(q, 50) : 'degree' must be less than number of unique points

model <- lm(noisy.y ~ poly(q,30))

# Error in poly(q, 30) : 'degree' must be less than number of unique points

model <- lm(noisy.y ~ poly(q,25))

# OK

Answer

**Polynomials and orthogonal polynomials**

`poly(x)`

has no hard-coded limit for `degree`

. However, there are two numerical constraints in practice.

Basis functions are constructed on unique location of

`x`

values. A polynomial of degree`k`

has`k + 1`

basis and coefficients.`poly`

generates basis without the intercept term, so`degree = k`

implies`k`

basis and`k`

coefficients. If there are`n`

unique`x`

values, it must be satisfied that`k <= n`

, otherwise there is simply insufficient information to construct a polynomial. Inside`poly()`

, the following line checks this condition:`if (degree >= length(unique(x))) stop("'degree' must be less than number of unique points")`

Correlation between

`x ^ k`

and`x ^ (k+1)`

is getting closer and closer to 1 as`k`

increases. Such approaching speed is of course dependent on`x`

values.`poly`

first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between`x ^ k`

and`x ^ (k+1)`

,`poly`

will also stop and complain:`if (QR$rank < degree) stop("'degree' must be less than number of unique points")`

But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then

`poly`

can reset`degree`

to`rank`

to proceed. Maybe R core can improve on this bit??

Your trial-and-error shows that you can't construct a polynomial of more than 25 degree, but the reason can be both. If you really want to know which one it is, you need a debugger to step through `poly`

. But what I want to express, is that **a polynomial of more than 3-5 degree is never useful!** The critical reason is the Runge's phenomenon. In terms of statistical terminology: **a high-order polynomial always badly overfits data!**. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, **a polynomial of degree k forms a vector space, so whatever basis you use for representation, they have the same span!**

**Splines: piecewise cubic polynomials and its use in regression**

Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:

- truncated power basis
- natural cubic spline basis
- B-spline basis

B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix `X'X`

is banded, thus solving normal equations `(X'X) b = (X'y)`

are very stable.

In R, we can use `bs`

function from `splines`

package (one of R base packages) to get B-spline basis. For `bs(x)`

, The only numerical constraint on degree of freedom `df`

is that we can't have more basis than `length(unique(x))`

.

I am not sure of what your data look like, but perhaps you can try

```
library(splines)
model <- lm(noisy.y ~ bs(q, df = 10))
```

**Penalized smoothing / regression splines**

Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.

**A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.**

The `smooth.spline`

function in `stats`

package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read `?smooth.spline`

for more. For your data, you may try

```
fit <- smooth.spline(q, noisy.y)
```

(Note, `smooth.spline`

has no formula interface.)

**Additive penalized splines and Generalized Additive Models (GAM)**

Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the `mgcv`

package, recommended by R.

You can still fit a univariate penalized regression spline with `mgcv`

:

```
library(mgcv)
fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))
```