Michael Michael - 2 months ago 6
SQL Question

Get 10 distinct projects with the latest updates in related tasks

I have two tables in a PostgreSQL 9.5 database:

project
- id
- name

task
- id
- project_id
- name
- updated_at


There are ~ 1000 projects (updated very rarely) and ~ 10 million tasks (updated very often).

I want to list those 10 distinct projects that have the latest task updates.

A basic query would be:

SELECT * FROM task ORDER BY updated_at DESC LIMIT 10;


However, there can be many updated tasks per project. So I won't get 10 unique projects.

If I try to add
DISTINCT(project_id)
somewhere in the query, I'm getting an error:


for SELECT DISTINCT, ORDER BY expressions must appear in select list


Problem is, I can't sort (primarily) by
project_id
, because I need to have tasks sorted by time. Sorting by
updated_at DESC, project_id ASC
doesn't work either, because several tasks of the same project can be among the latest.

I can't download all records because there are millions of them.

As a workaround I download 10x needed rows (without distinct) scope, and filter them in the backend. This works for most cases, but it's obviously not reliable: sometimes I don't get 10 unique projects.

Can this be solved efficiently in Postgres 9.5?

Example



id | name
----+-----------
1 | Project 1
2 | Project 2
3 | Project 3

id | project_id | name | updated_at
----+------------+--------+-----------------
1 | 1 | Task 1 | 13:12:43.361387
2 | 1 | Task 2 | 13:12:46.369279
3 | 2 | Task 3 | 13:12:54.680891
4 | 3 | Task 4 | 13:13:00.472579
5 | 3 | Task 5 | 13:13:04.384477


If I query:

SELECT project_id, updated_at FROM task ORDER BY updated_at DESC LIMIT 2


I get:

project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
3 | 13:13:00.472579


But I want to get 2 distinct projects with the respective latest
task.update_at
like this:

project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
2 | 13:12:54.680891 -- from Task 3

Answer

The simple (logically correct) solution is to aggregate tasks to get the latest update per project, and then pick the latest 10, like @Nemeros provided.

However, this incurs a sequential scan on task, which is undesirable (expensive) for big tables.

If you have relatively few projects (many task entries per project), there are faster alternatives using (bitmap) index scans.

SELECT *
FROM   project p
     , LATERAL (
   SELECT updated_at AS last_updated_at
   FROM   task
   WHERE  project_id = p.id
   ORDER  BY updated_at DESC
   LIMIT  1
   ) t
ORDER  BY t.last_updated_at
LIMIT  10;

Key to performance is a matching multicolumn index:

CREATE INDEX task_project_id_updated_at ON task (project_id, updated_at DESC);

A setup with 1000 projects and 10 million tasks (like you commented) is a perfect candidate for this.

Background:

NULL and "no row"

Above solution assumes updated_at is defined NOT NULL. Else use ORDER BY updated_at DESCNULLS LAST and ideally make the index match.

Projects without any tasks are eliminated from the result by the implicit CROSS JOIN. NULL values cannot creep in this way. This is subtly different from correlated subqueries like @Nemeros added to his answer: those return NULL values for "no row" (project has no related tasks at all). The outer descending sort order then lists NULL on top unless instructed otherwise. Most probably not what you want.

Related: