Tal Tal - 13 days ago 8
SQL Question

optimizing a nested query to avoid tremendeous performance hit

I'm breaking my head on the following sql query's odd performance.
I have the following query which is composed of following sql code (running on pg):

WITH temp_song_id_with_all_stemmed_words AS
(SELECT song_id FROM stemmed_words
WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
GROUP BY song_id
HAVING COUNT(*)=5)
SELECT *
FROM words
WHERE song_id IN(
SELECT song_id
FROM temp_song_id_with_all_stemmed_words
)
ORDER BY song_id, global_position;


It takes about 10 seconds to calculate with amount of data i put in tables.
I tried various approaches in order to optimize this query:


  • putting the "with" clause inside the query itself

  • putting the "with" clause inside a temp table and then querying on it

  • indexing every possible column



But it all was to no avail. Calculation time is still at the range of about 10 seconds (assuming everything is already cached in memory..., it its not it could take even up to a minute)

Then I noticed that when I split the query to its composing parts things behave entirely different:

SELECT song_id FROM stemmed_words
WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
GROUP BY song_id
HAVING COUNT(*)=5


this query takes about 500 ms to calculate tops, and gives 3 id's as a result

when i use these results in order to calculate the enclosing query::

SELECT *
FROM words
WHERE song_id IN(337409,328981,304231)
ORDER BY song_id, global_position;


it takes about 30ms to complete

I have no idea what goes on under the hood here, but i would imagine that a proper sql optimizer would do just what i did above.

when i look at the explain output i see the following:

"Merge Join (cost=20253.29..706336.00 rows=6312654 width=21)"
" Merge Cond: (words.song_id = temp_song_id_with_all_stemmed_words.song_id)"
" CTE temp_song_id_with_all_stemmed_words"
" -> HashAggregate (cost=19799.62..19936.11 rows=13649 width=4)"
" Group Key: stemmed_words.song_id"
" Filter: (count(*) = 5)"
" -> Bitmap Heap Scan on stemmed_words (cost=474.02..19714.55 rows=17014 width=4)"
" Recheck Cond: (stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
" -> Bitmap Index Scan on stemmed_words_pkey (cost=0.00..469.76 rows=17014 width=0)"
" Index Cond: (stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
" -> Index Scan using words_song_id_global_position_idx on words (cost=0.44..653025.11 rows=12625308 width=21)"
" -> Sort (cost=316.75..317.25 rows=200 width=4)"
" Sort Key: temp_song_id_with_all_stemmed_words.song_id"
" -> HashAggregate (cost=307.10..309.10 rows=200 width=4)"
" Group Key: temp_song_id_with_all_stemmed_words.song_id"
" -> CTE Scan on temp_song_id_with_all_stemmed_words (cost=0.00..272.98 rows=13649 width=4)"


but honestly i do not understand what is going on there... looks like chinese to me.

so to summerize my question:

I have a feeling that i can optimize this as a single query, instead of splitting it into two seperate ones.


  • why is it taking it so long to complete in its current form?

  • how can i optimize it with out splitting the query into two seperate ones like i did above?


Answer

The problem here is that PostgreSQL cannot correctly estimate the number of rows the CTE (= WITH query) will return.

PostgreSQL estimates 13649 rows, while you tell us the correct number is 3.

I'd expect good results with your second technique (putting the "with" clause inside a temp table and then querying on it) as long as you ANALYZE the temporary table between these two operations, because then PostgreSQL knows exactly how many values it has to deal with.