noname - 1 year ago 87
R Question

# `smooth.spline` severely underfits long (periodic) time series

I would like to smooth very long, noisy data, in R. But I have found that for highly periodic data, out-of-the-box

`smooth.spline()`
quickly breaks down and the smoothed data begins to exhibit ringing.

Consider a cosine time series (with or without noise)

``````t <- seq(0,100*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)

y100_s <- smooth.spline(y)\$y

plot( y~t, type="l" )
lines( y100_s~t, col="blue" )
``````

We can examine the effect of adding more values to
`smooth.spline()`
,

``````# rms increases as points are added to smooth.spline
rms <- sapply( seq(250,3000,by=250), function(i)
sqrt( mean( (y[1:i] - smooth.spline(y[1:i])\$y)^2 )) )

plot(rms)
``````

Even at lower frequencies the fit is ringing (optional).

``````t <- seq(0,50*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)

y50_s <- smooth.spline(y)\$y

require(pracma)

peaks <- list(findpeaks(y50_s),findpeaks(-y50_s))

plot( y~t, type="l" )
lines( y50_s~t, col="red" )

lines( peaks[[1]][,1]~t[peaks[[1]][,2]], type="l" )
lines( -peaks[[2]][,1]~t[peaks[[2]][,2]], type="l" )
``````

After exploring for a bit, this behaviour appears to be a function of the spar argument, but I can't set this to a small enough value to eliminate the effect. This might be an obvious consequence of spline fitting, and a fault of relying on out-of-the-box methods, but I would appreciate some insight. Are there controls I can specify in
`smooth.spline()`
, or alternative recommendations/strategies for smoothing?

I don't know whether you are always fitting a periodic signal. If that is the case, using periodic spline from `mgcv::gam` is much better. However, let's forget about this issue for the moment.

If your data have high, frequent oscillation, you have to choose sufficient number of knots, i.e., a decent density of knots, otherwise you just result in over-smoothing (i.e., under-fitting).

Have a look at your example:

``````t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) # + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y)
``````

You have `n = 3000` data points. By default, `smooth.spline` uses much smaller number of knots than data when `n > 49`. Precisely it is selected by a service routine `.nknots.smspl`. But there is no optimality justification for this. So it is up to you to justify whether this is reasonable. Let's check:

``````length(fit\$fit\$nk) - 2L  ## or `.nknots.smspl(3000)`
# [1] 194

fit\$df
# [1] 194
``````

It uses only 194 knots and model ends up with 194 degree of freedom without penalization effect. As I said earlier, you just end up with under-fitting:

``````plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
``````

Ideally, penalized regression ends up with a degree of freedom substantially smaller than number of knots. It is often forgotten that penalization is used to fix over-fitting problem resulting from the original non-penalized regression. If we don't even see the penalization effect, then the original non-penalized model is under-fitting data, so increase number of knots until we reach an over-fitting status. If you are lazy to think about this, set `all.knots = TRUE`. Univariate smoothing spline is very computationally cheap at `O(n)` costs. Even if you use all data as knots, you won't get into efficiency problem.

``````fit <- smooth.spline(t, y, all.knots = TRUE)

length(fit\$fit\$nk) - 2L
# [1] 3000

fit\$df
# [1] 3000
``````

Oh, we still did not see the effect of penalization, why? Because we don't have noisy data. You did not add noise to your `y`, so by using all knots we are doing interpolation. Add some noise to `y` to truly understand what I explain about penalization.

``````set.seed(0)
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) + rnorm(length(t), 0, 0.05)

fit <- smooth.spline(t, y, all.knots = TRUE)

length(fit\$fit\$nk)
# [1] 3002

fit\$df
# [1] 705.0414
``````

Note how much smaller 705 is compared with 3000. Have look at fitted spline?

``````plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
``````

There is neither under-fitting nor over-fitting; penalization results in optimal trade-off between bias and variance.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download