dsd dsd - 2 months ago 8
SQL Question

Left Join Lateral and array aggregates

I'm using Postgres 9.3.

I have two tables

T1
and
T2
and a
n:m
relation
T1_T2_rel
between them. Now I'd like to create a view that in addition to the columns of T1 provides a column that, for each record in T1, contains an array with the primary key ids of all related records of T2. If there are no related entries in T2, corresponding fields of this column shall contain null-values.

An abstracted version of my schema would look like this:

CREATE TABLE T1 ( t1_id serial primary key, t1_data int );

CREATE TABLE T2 ( t2_id serial primary key );

CREATE TABLE T1_T2_rel (
t1_id int references T1( t1_id )
, t2_id int references T2( t2_id )
);


Corresponding sample data could be generated as follows:

INSERT INTO T1 (t1_data)
SELECT cast(random()*100 as int) FROM generate_series(0,9) c(i);

INSERT INTO T2 (t2_id) SELECT nextval('T2_t2_id_seq') FROM generate_series(0,99);

INSERT INTO T1_T2_rel
SELECT cast(random()*10 as int) % 10 + 1 as t1_id
, cast(random()*99+1 as int) as t2_id
FROM generate_series(0,99);


So far, I've come up with the following query:

SELECT T1.t1_id, T1.t1_data, agg
FROM T1
LEFT JOIN LATERAL (
SELECT t1_id, array_agg(t2_id) as agg
FROM T1_T2_rel
WHERE t1_id=T1.t1_id
GROUP BY t1_id
) as temp ON temp.t1_id=T1.t1_id;


This works. However, can it be simplified?

A corresponding fiddle can be found here: sql-fiddle. Unfortunately, sql-fiddle does not support Postgres 9.3 (yet) which is required for lateral joins.

[Update] As has been pointed out, a simple
left join
using a subquery in principle is enough. However, If I compare the query plans, Postgres resorts to sequential scans on the aggregated tables when using a
left join
whereas index scans are used in the case of the
left join lateral
.

Answer

As @Denis already commented: no need for LATERAL. Also, your subquery selected the wrong column. This works:

SELECT t1.t1_id, t1.t1_data, t2_ids
FROM   t1
LEFT   JOIN (
    SELECT t1_id, array_agg(t2_id) AS t2_ids
    FROM   t1_t2_rel
    GROUP  BY 1
    ) sub USING (t1_id);

-SQL fiddle.

Performance and testing

Concerning the ensuing sequential scan you mention: If you query the whole table, a sequential scan is often faster. Depends on the version you are running, your hardware, your settings and statistics of cardinalities and distribution of your data. Experiment with selective WHERE clauses like WHERE t1.t1_id < 1000 or WHERE t1.t1_id = 1000 and combine with planner settings to learn about choices:

SET enable_seqscan = off;
SET enable_indexscan = off;

To reset:

RESET enable_seqscan;
RESET enable_indexscan;

Only in your local session, mind you! This related answer on dba.SE has more instructions.
Of course, your setting may be off, too: