Petr Shypila - 8 months ago 41

SQL Question

Actually I have understanding of how PERCENTILE_CONT works, but I get different result by calculating percentiles by hand and by calling a function.

Here is the dataset:

`305.7884804`

350

373.3728865

384.2094838

410.8603441

414.9842786

455.3545205

550

To calculate 25% I sum up

`350`

`373.3728865`

`361.68644325`

To calculate 50% I sum up

`384.2094838`

`410.8603441`

`397.53491395`

To calculate 75% I sum up

`414.9842786`

`455.3545205`

`435.16939955`

However, when I run this sql:

`select`

percentile_cont(0.25) within group(order by YEAR_2_FTE ASC),

percentile_cont(0.5) within group(order by YEAR_2_FTE ASC),

percentile_cont(0.75) within group(order by YEAR_2_FTE ASC) from sr_database

where firm_id=999;

I get such result:

What am I missing?

Answer

You're correctly taking the point halfway between the 4th and 5th value to calculate the 50th percentile. But taking the halfway point between the 2nd and 3rd for the 25th percentile (and similarly for the 75th) isn't correct. This is best seen with an illustration:

This shows 8 points and 4 lines drawn between them to make 4 equally sized segements. The 25th and 75th percentile markers fall closer to the 3rd and 6th values respectively.

The Oracle documentation describes how these are calculated:

The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

In your example, for the 25th percentile `P = 0.25`

and `N = 8`

so:

```
RN = (1+ 0.25*(8-1))
= 2.75
```

So `CRN = 3`

and `FRN = 2`

The documentation goes on to say:

The final result will be:

If (CRN = FRN = RN) then the result is

`(value of expression from row at RN)`

Otherwise the result is

`(CRN - RN) * (value of expression for row at FRN) + (RN - FRN) * (value of expression for row at CRN)`

Plugging in the values above, the "Otherwise..." part applies for the calculation i.e:

```
Result = (3 - 2.75) * 350 + (2.75 - 2) * 373.3728856
= 367.5296642
```

And similarly for the 75th percentile:

```
RN = (1+ 0.75*(8-1))
= 6.25
CRN = 7
FRN = 6
Result = (7 - 6.25) * 414.9842786 + (6.25 - 6) * 455.3545205
= 425.076839075
```