M. Beausoleil M. Beausoleil -3 years ago 133
R Question

Why is the sum of the area under density curve always greater than 1 (R)?

I found codes to calculate the sum of the area under a density curve in R. Unfortunately, I don't understand why there is always an extra ~"0.000976" at the area...

nb.data = 500000
y = rnorm(nb.data,10,2)

de = density(y)

require(zoo)
sum(diff(de$x[order(de$x)])*rollmean(de$y[order(de$x)],2))

[1] 1.000976


Why is that so?

It should be equal to 1, right?

Answer Source

This discrepancy is not just due to rounding errors or floating-point arithmetic. You are using linear interpolation (a.k.a. the trapzoidal rule), which means that you are overestimating the area in areas of the density that are concave up and underestimating it in areas that are concave down. Here's an example image from the Wikipedia article demonstrating the systematic error:


Trapezoidal rule illustration

Image by Intégration_num_trapèzes.svg: Scalerderivative work: Cdang (talk) - Intégration_num_trapèzes.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8541370


Since the normal distribution has more concave up areas, the overall estimate is too high. As mentioned in another answer, using a higher resolution (i.e. increasing N) helps to minimize the error. You might also get better results using a different method for numerical integration (e.g. Simpson's rule).

However, there is no numerical integration method that is going to give you an exact answer, and even if there was, the return value of density is only an approximation of the real distribution anyway. (And for real data, the true distribution is unknown.)

If all you want is the satisfaction of seeing a known density function integrating to 1, you can use integrate on the normal density function:

> integrate(dnorm, lower=-Inf, upper=Inf, mean=10, sd=2)
1 with absolute error < 4.9e-06
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download