Paul Paul - 4 months ago 6
SQL Question

Return data aggregated by one column id and data-aligned to another column id

I have a large table with millions of rows like this:

CREATE TABLE mytable (
row_id bigint,
col_id bigint,
value double precision,
timestamp timestamp
);


Given:


  1. list_row
    = a list of
    row_id
    s (can be ordered if needed)

  2. list_col
    = a list of
    col_id
    s (again, can be ordered if needed)

  3. Both lists may be very large (maybe 10s of thousands)

  4. The table above may have many millions of entries



How do I (efficiently) return a resource where:


  1. Columns are all the
    col_id
    s present in
    list_col
    and occur in the same order that the
    col_id
    s occur in
    list_col

  2. Rows are all the
    row_id
    s present in
    list_row
    (they need not occur in the same order)

  3. Each field contains the
    value
    s of the given
    row_id
    and
    col_id
    s.

  4. We are only interested in the most recently recorded
    value
    s for any
    row_id:col_id
    pair i.e. use
    MAX(timestamp
    ) or something similar as a filter

  5. In the result, if there is no recorded
    value
    for a given
    row_id:col_id
    co-ordinate then that field should be
    null
    .



A visual example to clarify. The initial table:

+--------+--------+-------+-----------+
| row_id | col_id | value | timestamp |
+========+========+=======+===========+
| 10 | 20 | 100 | 2016-0... |
| 10 | 21 | 200 | 2015-0... |
| 11 | 20 | 300 | 2016-1... |
| 11 | 22 | 400 | 2016-0... |
+--------+--------+-------+-----------+


becomes:

col_id →
+-----------------+
| 20 | 21 | 22 |
+=====+=====+=====+
row_id (10) | 100 | 200 | |
↓ (11) | 300 | | 400 |
+-----+-----+-----+


I suspect that the correct answer is to start by creating a temporary table with the target
col_id
s as columns and then do some sort of join. I cannot work out how to do this efficiently. Is it possible to do this without needing a temporary table for each
row_id
?

Answer

crosstab() would work for regular queries:

But not for your case because of:

  1. Both lists may be very large (maybe 10s of thousands)

That's too many columns for Postgres. The manual:

There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600. However, defining a table with anywhere near this many columns is highly unusual and often a questionable design.

I suggest to return arrays instead. Something like:

SELECT row_id
     , array_agg(col_id) AS cols
     , array_agg(value)  AS vals
FROM  (
   SELECT DISTINCT ON (row_id, col_id)  --  most recent values for row_id:col_id pair 
          row_id, col_id, value
   FROM   mytable
   WHERE  row_id IN (<long list>)
   AND    col_id IN (<long list>)
   ORDER  BY row_id, col_id, timestamp DESC
   ) sub
GROUP   BY 1;

About DISTINCT ON: