Michael Michael - 2 months ago 11
SQL Question

PostgreSQL: distinct scope on joined tabled

I have a problem with building a SQL query. (PostgreSQL 9.5)

I have two tables:

project:
- id
- name

task:
- id
- project_id
- name
- updated_at


I want to list last 10 unique projects that have task updates.

Basic query would be

SELECT * FROM task ORDER BY updated_at DESC LIMIT 10


However it is not correct because you could have many tasks updated per project (so you would not have 10 unique projects).

If I try to add
DISTINCT(project_id)
somewhere in the query, I'm getting an error:

for SELECT DISTINCT, ORDER BY expressions must appear in select list


Problem is, I can't sort (primarily) by project id, because I need to have tasks sorted by time. I could sort by
updated_at DESC, project_id ASC
, but this doesn't satisfy PGSQL.

I can't download all records because there are millions of them.

As a workaround I download 10x needed rows (without distinct) scope, and filter them on the backend. This works for most cases, but is not correct and sometimes I don't get 10 unique projects.

Can this be solved efficiently in PGSQL 9.5?




Example data:

id | name
----+-----------
1 | Project 1
2 | Project 2
3 | Project 3

id | project_id | name | updated_at
----+------------+--------+-----------------
1 | 1 | Task 1 | 13:12:43.361387
2 | 1 | Task 2 | 13:12:46.369279
3 | 2 | Task 3 | 13:12:54.680891
4 | 3 | Task 4 | 13:13:00.472579
5 | 3 | Task 5 | 13:13:04.384477


If I query

SELECT project_id, updated_at FROM task ORDER BY updated_at DESC LIMIT 2


I'll get

project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
3 | 13:13:00.472579


I want to get 2 unique projects, and last
update_at
of relevant task, like this:

project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
2 | 13:12:54.680891 (from Task 3)

Answer

The simple (logically correct) solution is to aggregate tasks to get the latest update per project, and then pick the latest 10, like @Nemeros provided.

However, this incurs a sequential scan on task, which is undesirable (expensive) for big tables.

If you have relatively few projects (many task entries per project), there are faster alternatives using (bitmap) index scans.

SELECT *
FROM   project p
     , LATERAL (
   SELECT updated_at AS last_updated_at
   FROM   task
   WHERE  project_id = p.id
   ORDER  BY updated_at DESC
   LIMIT  1
   ) t
ORDER  BY t.last_updated_at
LIMIT  10;

Key to performance is a matching multicolumn index:

CREATE INDEX task_project_id_updated_at ON task (project_id, updated_at DESC);

A setup with 1000 projects and 10 million tasks (like you commented) is a perfect candidate for this.

Assuming updated_at is defined NOT NULL - or make that ORDER BY updated_at DESC NULLS LAST and make the index match.

Background: