Petr Shypila Petr Shypila - 6 months ago 30
SQL Question

SQL. Unexpected result of calling PERCENTILE_CONT function

Actually I have understanding of how PERCENTILE_CONT works, but I get different result by calculating percentiles by hand and by calling a function.
Here is the dataset:

305.7884804
350
373.3728865
384.2094838
410.8603441
414.9842786
455.3545205
550


To calculate 25% I sum up
350
and
373.3728865
and then divide it by 2. And the result is
361.68644325
.

To calculate 50% I sum up
384.2094838
and
410.8603441
and then divide it by 2. And the result is
397.53491395
.

To calculate 75% I sum up
414.9842786
and
455.3545205
and then divide it by 2. And the result is
435.16939955
.

However, when I run this sql:

select
percentile_cont(0.25) within group(order by YEAR_2_FTE ASC),
percentile_cont(0.5) within group(order by YEAR_2_FTE ASC),
percentile_cont(0.75) within group(order by YEAR_2_FTE ASC) from sr_database
where firm_id=999;


I get such result:

25%: 367.529664875

50%: 397.53491395

75%: 425.076839075

What am I missing?

Answer

You're correctly taking the point halfway between the 4th and 5th value to calculate the 50th percentile. But taking the halfway point between the 2nd and 3rd for the 25th percentile (and similarly for the 75th) isn't correct. This is best seen with an illustration:

enter image description here

This shows 8 points and 4 lines drawn between them to make 4 equally sized segements. The 25th and 75th percentile markers fall closer to the 3rd and 6th values respectively.

The Oracle documentation describes how these are calculated:

The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

In your example, for the 25th percentile P = 0.25 and N = 8 so:

RN = (1+ 0.25*(8-1))
   = 2.75

So CRN = 3 and FRN = 2

The documentation goes on to say:

The final result will be:

If (CRN = FRN = RN) then the result is

(value of expression from row at RN)

Otherwise the result is

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

Plugging in the values above, the "Otherwise..." part applies for the calculation i.e:

Result = (3 - 2.75) * 350 + (2.75 - 2) * 373.3728856
       = 367.5296642

And similarly for the 75th percentile:

 RN = (1+ 0.75*(8-1))
    = 6.25
CRN = 7
FRN = 6
Result = (7 - 6.25) * 414.9842786 + (6.25 - 6) * 455.3545205
       = 425.076839075