CaptainStiggz CaptainStiggz - 2 months ago 7
MySQL Question

mysql - Optimizing ORDER BY COALESCE on joined table column

EDITED: added full query by request.

In essence I have a table of posts linked one to many to a table of reposts, akin to Twitter. I want to load the posts ordered by the time of the repost (if present) or the time of the original post. However, the ordering process is very slow using a single query (probably do the the fact that COALESCE(x, y) doesn't make full use of MySQL indexes). The time column on both relevant tables is indexed.

My query looks something like this.

SELECT * FROM Post p LEFT JOIN p.reposts ON ... WHERE ...
ORDER BY COALESCE(r.time, p.time) LIMIT 0, 10


More precisely (pseudo-ish) since I'm using a DAL:

SELECT * FROM Post p LEFT JOIN p.reposts repost ON (p.id = repost.post_id AND
repost.time = (
SELECT MIN(r.time) FROM Repost r WHERE p.id = r.post_id
AND r.user_id IN (1, 2, 3...) AND r.user_id NOT IN (4, 5, 6...))
))
WHERE (repost IS NOT NULL OR p.author_id IN (1, 2, 3...))
AND p.author_id NOT IN (4, 5, 6...)
ORDER BY COALESCE(repost.time, p.time) LIMIT 0, 10


In the above, the ON clause ensures at most one repost (the one I want) is joined. COALESCE is necessary because r may be NULL if a post has not been reposted. The query behaves as expected - fast when ORDER BY clause is omitted, or used only on an indexed column like p.time. This is to be expected since the Post table is large 100k+ rows.

Query Explanation

EDIT: better explanation of what query should do. It's worth noting the logic here works - I get the data I want. The problem is that applying the ORDER BY clause causes the query to run about 50x slower because MySQL can't use the indexes with COALESCE on a joined table.


  • Load a list of 10 posts that are either authored by a set of users (followed) or reposted by the same set (followed), ordered by most recent.

  • Posts should be ordered by either the time of the post or the first repost.

  • Ignore posts and reposts by users in a different set (blocked)

  • Get posts: SELECT from posts

  • Get the earliest repost by a user in the followed set: LEFT JOIN ON... r.time = (SELECT MIN(r.time)...)

  • Filter out posts not authored or reposted by users in the followed set: WHERE (repost IS NOT NULL...)

  • Order be the first repost (if it exists) or the publication time: ORDER BY COALESCE(repost.time, p.time)

  • Load at most 10 posts: LIMIT 0, 10



UPDATE

I found that:

...ORDER BY repost.time DESC


Produces slow results as well unless I also add:

...WHERE repost.id IS NOT NULL...


In which case the query is fast. This leads me to believe that the real problem is sorting on nullable column indexes. I also tried:

... ORDER BY CASE WHEN repost.id IS NULL p.time ELSE repost.time END DESC


Which didn't help.

UPDATE 2

Due to the fact that MySQL uses b-trees for its indexes, it seems it'll be impossible to leverage the indexes in the way I want. Thus my current best idea is to treat each original post as a "repost" by its author, then perform my select and order on the repost table, e.g.

SELECT * FROM Repost r LEFT JOIN r.post ON ... WHERE ... ORDER BY r.time DESC

Answer

The problem here was as I described in update 2 of my question. MySQL uses indexes to perform ORDER BY operations quickly. More specifically, MySQL uses B-trees to index columns (such as timestamps - p.time/r.time), which use up a bit more space but allow for faster sorting.

The issue with my query was that it was sorting by the time column in two tables, using the timestamp from the repost table if available, and the post table otherwise. Since MySQL can't combine the B-trees from both tables, it can't perform fast index sorting on columns from two different tables.

I modified my query and table structure in two ways to solve this.

1) Perform filtering based on blocked users first, so ordering only has to be done on posts that are accessible by the current user. This was not the root of the problem, but is practical optimization. e.g.

SELECT * FROM (SELECT * FROM Post p WHERE p.author_id NOT IN (4, 5, 6...))...

2) Treat every post as a repost by its author, so every post is guaranteed to have a joinable repost and repost.time on which to index and sort. e.g.

SELECT * FROM (...) LEFT JOIN p.reposts repost ON (p.id = repost.post_id AND 
repost.time = (
  SELECT MIN(r.time) FROM Repost r WHERE p.id = r.post_id
  AND r.user_id IN (1, 2, 3...) AND r.user_id NOT IN (4, 5, 6...))
))
WHERE (repost.id IS NOT NULL) ORDER BY repost.time DESC LIMIT 0, 10

At the end of the day the issue came down to ORDER BY - this approach reduced the query time from about 8 seconds to 20 ms.

Comments