Manquer Manquer - 1 month ago 14
SQL Question

Postgres: Statistical functions on date time intervals

I need to run some statistical analysis on intervals i.e. difference between two datetime fields in a table.

According to the aggregate function documentation here. The aggregate
functions for

MAX
,
MIN
,
AVG
etc i.e. general aggregate functions accept arguments for date-time and interval fields.

However for more advanced statistical functions like
stddev_pop
,
var_pop
,
var_sam
and
std_samp
the input supported seem to be numeric or similar only. Although the documentation suggests there is no difference between the two types of functions


... (These are separated out merely to avoid cluttering the listing of
more-commonly-used aggregates.)...


Is there any easy way to calculate these parameters ? and why is interval type as argument not supported ? These types of statistical aggregate functions are unit invariant ?

P.S. I am not sure I can extract epoch and use it, as some values may be negative.

Answer

As I said in a comment, to work out sample standard deviation manually, at some point you multiply an interval by an interval. PostgreSQL doesn't support that.

To work around that issue, reduce the interval to hours or minutes or seconds (or whatever). This turns out to be a lot simpler than working out the calculation manually, and it suggests why PostgreSQL doesn't support this kind of calculation out of the box.

First, a function from the PostgreSQL general mailing list

CREATE OR REPLACE FUNCTION interval_to_seconds(interval)
RETURNS double precision AS $$
    SELECT (extract(days from $1) * 86400)
         + (extract(hours from $1) * 3600)
         + (extract(minutes from $1) * 60)
         + extract(seconds from $1);
$$ LANGUAGE SQL;

Now we can take the standard deviation of a simple set of intervals.

with intervals (i) as (
  values (interval '1 hour'), (interval '2 hour'), (interval '3 hour'), 
         (interval '4 hour'), (interval '5 hour') 
)
, intervals_as_seconds as (
  select interval_to_seconds(i) as seconds
  from intervals
)
select stddev(seconds), stddev(seconds)/60
from intervals_as_seconds
in_sec             in_min
double precision   double precision
--
5692.09978830308   94.8683298050514

You can verify the results however you like.

Now let's say you wanted hour granularity instead of seconds. Clearly, the choice of granularity is highly application dependent. You might define another function, interval_to_hours(interval). You can use a very similar query to calculate the standard deviation.

with intervals (i) as (
  values (interval '1 hour'), (interval '2 hour'), (interval '3 hour'), 
         (interval '4 hour'), (interval '5 hour') 
)
, intervals_as_hours as (
  select interval_to_hours(i) as hours
  from intervals
)
select stddev(hours) as stddev_in_hrs
from intervals_as_hours
stddev_in_hrs
double precision
--
1.58113883008419

The value for standard deviation in hours is clearly different from the value in minutes or in seconds. But they measure exactly the same thing. The point is that the "right" answer depends on the granularity (units) you want to use, and there are a lot of choices. (From microseconds to centuries, I imagine.)

Also, consider this statement.

select interval_to_hours(interval '45 minutes')
interval_to_hours
double precision
--
0

Is that the right answer? You can't say; the right answer is application-dependent. I can imagine applications that would want 45 minutes to be considered as 1 hour. I can also imagine applications that would want 45 minutes to be considered as 1 hour for some calculations, and as 0 hours for other calculations.

And think about this question. How many seconds are in a month?

And I think that's why PostgreSQL doesn't support this kind of calculation out of the box. The right way to do it with interval arguments is too application-dependent.

Later . . .

I found this discussion on one of the PostgreSQL mailing lists.

No stddev() for interval?