tilt tilt - 2 months ago 9
SQL Question

inconsistent behaviour of set-returning functions in sub-query with random()

I'm often using the WHERE clause

random() > 0.5
to pick a random subset of my data. Now I noticed that when using a set-returning function in a sub-query, I either get the whole set or none (meaning that the WHERE random() > 0.5 clause is interpreted before the set is being generated).
e.g.:

SELECT num
FROM (
SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num
) AS foo
WHERE random() > 0.5;


This seems inconsistent because the following query does take the whole set into account:

SELECT num
FROM (
SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num
) AS foo
WHERE random() > 0.1 * num;


Am I correct that this is inconsistent or does it make sense?

Notes:


  • couldn't find another function to test apart from random(), but likely there is some

  • I tested with generate_series as well


Answer

Indeed the postgres mailinglist gave a good response and it is likely a bug.

This is the answer, including workaround, from Tom Lane:


Hmm, I think this is an optimizer bug. There are two legitimate behaviors here:

SELECT * FROM unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should (and does) re-evaluate the WHERE for every row output by unnest().

SELECT unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should evaluate WHERE only once, since that happens before expansion of the set-returning function in the targetlist. (If you're an Oracle user and you imagine this query as having an implicit "FROM dual", the WHERE should be evaluated for the single row coming out of the FROM clause.)

In the case you've got here, given the placement of the WHERE in the outer query, you'd certainly expect it to be evaluated for each row coming out of the inner query. But the optimizer is deciding it can push the WHERE clause down to become a WHERE of the sub-select. That is legitimate in a lot of cases, but not when there are SRF(s) in the sub-select's targetlist, because that pushes the WHERE to occur before the SRF(s), analogously to the change between the two queries I wrote.

I'm a bit hesitant to change this in existing releases. Given the lack of previous complaints, it seems more likely to break queries that were behaving as-expected than to make people happy. But we could change it in v10 and up, especially since some other corner-case changes in SRF-in-tlist behavior are afoot.

In the meantime, you could force it to work as you wish by inserting the all-purpose optimization fence "OFFSET 0" in the sub-select:

=# SELECT num FROM (
    SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num OFFSET 0) AS foo WHERE random() > 0.5;
 num
-----
   1
   4
   7
   9
(4 rows)