Corey Corey - 3 months ago 35
R Question

LOESS warnings/errors related to span in R

I am running a LOESS regression in R and have come across warnings with some of my smaller data sets.

Warning messages:


1: In simpleLoess(y, x, w, span, degree = degree, parametric =
parametric,  :   pseudoinverse used at -2703.9

2: In simpleLoess(y, x, w, span, degree = degree, parametric =
parametric,  :   neighborhood radius 796.09

3: In simpleLoess(y, x, w, span, degree = degree, parametric =
parametric,  :   reciprocal condition number  0

4: In simpleLoess(y, x, w, span, degree = degree, parametric =
parametric,  :   There are other near singularities as well.
6.1623e+005


These errors are discussed in another post here:
Understanding loess errors in R .

It seems to be that these warnings are related to the span set for the LOESS regression. I am trying to apply a similar methodology that was done with other data sets where the parameters for an acceptable smoothing span was between 0.3 and 0.6. In some cases, I am able to adjust the span to avoid these issues, but in other data sets, the span had to be increased beyond the acceptable levels in order to avoid the errors/warnings.

I am curious as to what specifically these warnings mean, and whether this would be a situation where the regression is usable, but it should be noted that these warnings occurred, or if the regression is completely invalid.

Here is an example of a data set that is having issues:

Period Value Total1 Total2
-2950 0.104938272 32.4 3.4
-2715 0.054347826 46 2.5
-2715 0.128378378 37 4.75
-2715 0.188679245 39.75 7.5
-3500 0.245014245 39 9.555555556
-3500 0.163120567 105.75 17.25
-3500 0.086956522 28.75 2.5
-4350 0.171038825 31.76666667 5.433333333
-3650 0.143798024 30.36666667 4.366666667
-4350 0.235588972 26.6 6.266666667
-3500 0.228840125 79.75 18.25
-4933 0.154931973 70 10.8452381
-4350 0.021428571 35 0.75
-3500 0.0625 28 1.75
-2715 0.160714286 28 4.5
-2715 0.110047847 52.25 5.75
-3500 0.176923077 32.5 5.75
-3500 0.226277372 34.25 7.75
-2715 0.132625995 188.5 25


And here is the data without the line-breaks

Period Value Total1 Total2
-2950 0.104938272 32.4 3.4
-2715 0.054347826 46 2.5
-2715 0.128378378 37 4.75
-2715 0.188679245 39.75 7.5
-3500 0.245014245 39 9.555555556
-3500 0.163120567 105.75 17.25
-3500 0.086956522 28.75 2.5
-4350 0.171038825 31.76666667 5.433333333
-3650 0.143798024 30.36666667 4.366666667
-4350 0.235588972 26.6 6.266666667
-3500 0.228840125 79.75 18.25
-4933 0.154931973 70 10.8452381
-4350 0.021428571 35 0.75
-3500 0.0625 28 1.75
-2715 0.160714286 28 4.5
-2715 0.110047847 52.25 5.75
-3500 0.176923077 32.5 5.75
-3500 0.226277372 34.25 7.75
-2715 0.132625995 188.5 25


Here is the code I am using:

Analysis <- read.csv(file.choose(), header = T)
plot(Value ~ Period, Analysis)
a <- order(Analysis$Period)
Analysis.lo <- loess(Value ~ Period, Analysis, weights = Total1)
pred <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)


Thanks for your help, and please let me know if any additional information is necessary.

Additional information on jittering results:

First image is without jittering

Without Jittering

Second image is with jittering

With Jittering

Answer

The warnings are issued because the algorithm for loess finds numerical difficulties, due to the fact that Period has a few values which are repeated a relatively large number of times, as you can see from your plot and also with:

table(Analysis$Period)

In that respect, Period behaves in fact like a discrete variable (a factor), rather than a continuous one as it would be required for a proper smoothing. Adding some jitter removes the warnings:

Analysis <- read.table(header = T,text="Period  Value   Total1  Total2
-2950   0.104938272 32.4    3.4
-2715   0.054347826 46  2.5
-2715   0.128378378 37  4.75
-2715   0.188679245 39.75   7.5
-3500   0.245014245 39  9.555555556
-3500   0.163120567 105.75  17.25
-3500   0.086956522 28.75   2.5
-4350   0.171038825 31.76666667 5.433333333
-3650   0.143798024 30.36666667 4.366666667
-4350   0.235588972 26.6    6.266666667
-3500   0.228840125 79.75   18.25
-4933   0.154931973 70  10.8452381
-4350   0.021428571 35  0.75
-3500   0.0625  28  1.75
-2715   0.160714286 28  4.5
-2715   0.110047847 52.25   5.75
-3500   0.176923077 32.5    5.75
-3500   0.226277372 34.25   7.75
-2715   0.132625995 188.5   25")

table(Analysis$Period)    
Analysis$Period <- jitter(Analysis$Period, factor=0.2)

plot(Value ~ Period, Analysis)
a <- order(Analysis$Period)
Analysis.lo <- loess(Value ~ Period, Analysis, weights = Total1)
pred <- predict(Analysis.lo, se = TRUE)
lines(Analysis$Period[a], pred$fit[a], col="red", lwd=3)
lines(Analysis$Period[a], pred$fit[a] - qt(0.975, pred$df)*pred$se[a],lty=2)
lines(Analysis$Period[a], pred$fit[a] + qt(0.975,pred$df)*pred$se[a],lty=2)

Increasing the span parameter has the effect of "squashing out", along the Period axis, the piles of repeated values where they occur; with small datasets you need a lot of squashing to compensate for the piling up of repeated Periods.

From the practical viewpoint, I would generally still trust the regression, possibly after examination of the graphical output. But I would definitely not increase span to achieve the squashing: it is a lot better to use a tiny amount of jitter for that purpose; span should be dictated by other considerations, such as the overall spread of your Period data etc.

Comments