noname - 4 months ago 37

R Question

I would like to smooth very long, noisy data, in R. But I have found that for highly periodic data, out-of-the-box

`smooth.spline()`

Consider a cosine time series (with or without noise)

`t <- seq(0,100*2*pi,length.out=3000)`

y <- cos(t)# + rnorm(length(t), 0,0.05)

y100_s <- smooth.spline(y)$y

plot( y~t, type="l" )

lines( y100_s~t, col="blue" )

We can examine the effect of adding more values to

`smooth.spline()`

`# rms increases as points are added to smooth.spline`

rms <- sapply( seq(250,3000,by=250), function(i)

sqrt( mean( (y[1:i] - smooth.spline(y[1:i])$y)^2 )) )

plot(rms)

Even at lower frequencies the fit is ringing (optional).

`t <- seq(0,50*2*pi,length.out=3000)`

y <- cos(t)# + rnorm(length(t), 0,0.05)

y50_s <- smooth.spline(y)$y

require(pracma)

peaks <- list(findpeaks(y50_s),findpeaks(-y50_s))

plot( y~t, type="l" )

lines( y50_s~t, col="red" )

lines( peaks[[1]][,1]~t[peaks[[1]][,2]], type="l" )

lines( -peaks[[2]][,1]~t[peaks[[2]][,2]], type="l" )

After exploring for a bit, this behaviour appears to be a function of the spar argument, but I can't set this to a small enough value to eliminate the effect. This might be an obvious consequence of spline fitting, and a fault of relying on out-of-the-box methods, but I would appreciate some insight. Are there controls I can specify in

`smooth.spline()`

Answer

I don't know whether you are always fitting a periodic signal. If that is the case, using periodic spline from `mgcv::gam`

is much better. However, let's forget about this issue for the moment.

**If your data have high, frequent oscillation, you have to choose sufficient number of knots, i.e., a decent density of knots, otherwise you just result in over-smoothing (i.e., under-fitting).**

Have a look at your example:

```
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) # + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y)
```

You have `n = 3000`

data points. By default, `smooth.spline`

uses much smaller number of knots than data when `n > 49`

. Precisely it is selected by a service routine `.nknots.smspl`

. But there is no optimality justification for this. So it is up to you to justify whether this is reasonable. Let's check:

```
length(fit$fit$nk) - 2L ## or `.nknots.smspl(3000)`
# [1] 194
fit$df
# [1] 194
```

It uses only 194 knots and model ends up with 194 degree of freedom without penalization effect. As I said earlier, you just end up with under-fitting:

```
plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
```

Ideally, penalized regression ends up with a degree of freedom substantially smaller than number of knots. It is often forgotten that penalization is used to fix over-fitting problem resulting from the original non-penalized regression. If we don't even see the penalization effect, then the original non-penalized model is under-fitting data, so increase number of knots until we reach an over-fitting status. If you are lazy to think about this, set `all.knots = TRUE`

. Univariate smoothing spline is very computationally cheap at `O(n)`

costs. Even if you use all data as knots, you won't get into efficiency problem.

```
fit <- smooth.spline(t, y, all.knots = TRUE)
length(fit$fit$nk) - 2L
# [1] 3000
fit$df
# [1] 3000
```

Oh, we still did not see the effect of penalization, why? Because we don't have noisy data. You did not add noise to your `y`

, so by using all knots we are doing interpolation. Add some noise to `y`

to truly understand what I explain about penalization.

```
set.seed(0)
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y, all.knots = TRUE)
length(fit$fit$nk)
# [1] 3002
fit$df
# [1] 705.0414
```

Note how much smaller 705 is compared with 3000. Have look at fitted spline?

```
plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
```

There is neither under-fitting nor over-fitting; penalization results in optimal trade-off between bias and variance.