Manquer - 11 months ago 81

SQL Question

I need to run some statistical analysis on intervals i.e. difference between two datetime fields in a table.

According to the aggregate function documentation here. The aggregate

functions for

`MAX`

`MIN`

`AVG`

However for more advanced statistical functions like

`stddev_pop`

`var_pop`

`var_sam`

`std_samp`

... (These are separated out merely to avoid cluttering the listing of

more-commonly-used aggregates.)...

Is there any easy way to calculate these parameters ? and why is interval type as argument not supported ? These types of statistical aggregate functions are unit invariant ?

P.S. I am not sure I can extract epoch and use it, as some values may be negative.

Answer Source

As I said in a comment, to work out sample standard deviation manually, at some point you multiply an interval by an interval. PostgreSQL doesn't support that.

To work around that issue, reduce the interval to hours or minutes or seconds (or whatever). This turns out to be a lot simpler than working out the calculation manually, and it suggests why PostgreSQL doesn't support this kind of calculation out of the box.

First, a function from the PostgreSQL general mailing list

```
CREATE OR REPLACE FUNCTION interval_to_seconds(interval)
RETURNS double precision AS $$
SELECT (extract(days from $1) * 86400)
+ (extract(hours from $1) * 3600)
+ (extract(minutes from $1) * 60)
+ extract(seconds from $1);
$$ LANGUAGE SQL;
```

Now we can take the standard deviation of a simple set of intervals.

```
with intervals (i) as (
values (interval '1 hour'), (interval '2 hour'), (interval '3 hour'),
(interval '4 hour'), (interval '5 hour')
)
, intervals_as_seconds as (
select interval_to_seconds(i) as seconds
from intervals
)
select stddev(seconds), stddev(seconds)/60
from intervals_as_seconds
```

in_sec in_min double precision double precision -- 5692.09978830308 94.8683298050514

You can verify the results however you like.

Now let's say you wanted hour granularity instead of seconds. Clearly, the choice of granularity is highly application dependent. You might define another function, `interval_to_hours(interval)`

. You can use a very similar query to calculate the standard deviation.

```
with intervals (i) as (
values (interval '1 hour'), (interval '2 hour'), (interval '3 hour'),
(interval '4 hour'), (interval '5 hour')
)
, intervals_as_hours as (
select interval_to_hours(i) as hours
from intervals
)
select stddev(hours) as stddev_in_hrs
from intervals_as_hours
```

stddev_in_hrs double precision -- 1.58113883008419

The value for standard deviation in hours is clearly different from the value in minutes or in seconds. But they measure exactly the same thing. The point is that the "right" answer depends on the granularity (units) you want to use, and there are a lot of choices. (From microseconds to centuries, I imagine.)

Also, consider this statement.

```
select interval_to_hours(interval '45 minutes')
```

interval_to_hours double precision -- 0

Is that the right answer? You can't say; the right answer is application-dependent. I can imagine applications that would want 45 minutes to be considered as 1 hour. I can also imagine applications that would want 45 minutes to be considered as 1 hour for *some* calculations, and as 0 hours for *other* calculations.

And think about this question. How many seconds are in a month?

And I think *that's* why PostgreSQL doesn't support this kind of calculation out of the box. The right way to do it with interval arguments is too application-dependent.

**Later . . .**

I found this discussion on one of the PostgreSQL mailing lists.