Petr Shypila - 4 months ago 15x
SQL Question

# SQL. Unexpected result of calling PERCENTILE_CONT function

Actually I have understanding of how PERCENTILE_CONT works, but I get different result by calculating percentiles by hand and by calling a function.
Here is the dataset:

``````305.7884804
350
373.3728865
384.2094838
410.8603441
414.9842786
455.3545205
550
``````

To calculate 25% I sum up
`350`
and
`373.3728865`
and then divide it by 2. And the result is
`361.68644325`
.

To calculate 50% I sum up
`384.2094838`
and
`410.8603441`
and then divide it by 2. And the result is
`397.53491395`
.

To calculate 75% I sum up
`414.9842786`
and
`455.3545205`
and then divide it by 2. And the result is
`435.16939955`
.

However, when I run this sql:

``````select
percentile_cont(0.25) within group(order by YEAR_2_FTE ASC),
percentile_cont(0.5) within group(order by YEAR_2_FTE ASC),
percentile_cont(0.75) within group(order by YEAR_2_FTE ASC) from sr_database
where firm_id=999;
``````

I get such result:

25%: 367.529664875

50%: 397.53491395

75%: 425.076839075

What am I missing?

You're correctly taking the point halfway between the 4th and 5th value to calculate the 50th percentile. But taking the halfway point between the 2nd and 3rd for the 25th percentile (and similarly for the 75th) isn't correct. This is best seen with an illustration:

This shows 8 points and 4 lines drawn between them to make 4 equally sized segements. The 25th and 75th percentile markers fall closer to the 3rd and 6th values respectively.

The Oracle documentation describes how these are calculated:

The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

In your example, for the 25th percentile `P = 0.25` and `N = 8` so:

``````RN = (1+ 0.25*(8-1))
= 2.75
``````

So `CRN = 3` and `FRN = 2`

The documentation goes on to say:

The final result will be:

If (CRN = FRN = RN) then the result is

``````(value of expression from row at RN)
``````

Otherwise the result is

``````(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)
``````

Plugging in the values above, the "Otherwise..." part applies for the calculation i.e:

``````Result = (3 - 2.75) * 350 + (2.75 - 2) * 373.3728856
= 367.5296642
``````

And similarly for the 75th percentile:

`````` RN = (1+ 0.75*(8-1))
= 6.25
CRN = 7
FRN = 6
Result = (7 - 6.25) * 414.9842786 + (6.25 - 6) * 455.3545205
= 425.076839075
``````