 M. Beausoleil -3 years ago 133
R Question

# Why is the sum of the area under density curve always greater than 1 (R)?

I found codes to calculate the sum of the area under a density curve in R. Unfortunately, I don't understand why there is always an extra ~"0.000976" at the area...

``````nb.data = 500000
y = rnorm(nb.data,10,2)

de = density(y)

require(zoo)
sum(diff(de\$x[order(de\$x)])*rollmean(de\$y[order(de\$x)],2))

 1.000976
``````

Why is that so?

It should be equal to 1, right? Ryan Thompson

This discrepancy is not just due to rounding errors or floating-point arithmetic. You are using linear interpolation (a.k.a. the trapzoidal rule), which means that you are overestimating the area in areas of the density that are concave up and underestimating it in areas that are concave down. Here's an example image from the Wikipedia article demonstrating the systematic error: Image by Intégration_num_trapèzes.svg: Scalerderivative work: Cdang (talk) - Intégration_num_trapèzes.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8541370

Since the normal distribution has more concave up areas, the overall estimate is too high. As mentioned in another answer, using a higher resolution (i.e. increasing N) helps to minimize the error. You might also get better results using a different method for numerical integration (e.g. Simpson's rule).

However, there is no numerical integration method that is going to give you an exact answer, and even if there was, the return value of `density` is only an approximation of the real distribution anyway. (And for real data, the true distribution is unknown.)

If all you want is the satisfaction of seeing a known density function integrating to 1, you can use `integrate` on the normal density function:

``````> integrate(dnorm, lower=-Inf, upper=Inf, mean=10, sd=2)
1 with absolute error < 4.9e-06
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download