Alexey Alexey - 6 months ago 72
SQL Question

MySQL sorting with Using temporary; Using filesort

Here is the query I'm trying to launch:

SELECT c.creative_id, c.creative_title, c.creative_image_name, c.gravity, c.ad_strength
FROM creatives AS c
INNER JOIN term_relationships AS tr ON c.creative_id = tr.creative_id
WHERE tr.term_id
IN ( 14, 1, 50, 76, 104 )
GROUP BY c.creative_id
HAVING COUNT(tr.term_id ) =5
ORDER BY c.gravity ASC
LIMIT 30;


Here is what
EXPLAIN
for this query outputs:

enter image description here

Here is the
creatives
table structure:

CREATE TABLE `creatives` (
`creative_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`scraper_id` bigint(20) unsigned DEFAULT NULL,
`creative_title` varchar(255) NOT NULL,
`creative_image_name` varchar(255) DEFAULT NULL,
`image_attrib` varchar(12) DEFAULT NULL,
`original_image_name` varchar(255) DEFAULT NULL,
`creative_subtext` varchar(255) DEFAULT NULL,
`dest_url` varchar(2083) NOT NULL,
`lp_url` varchar(2083) NOT NULL,
`lp_image_name` varchar(255) DEFAULT NULL,
`lp_image_flag` tinyint(1) unsigned NOT NULL DEFAULT '0',
`creative_first_seen` date NOT NULL,
`creative_last_seen` date NOT NULL,
`daily_ad_count` int(5) unsigned NOT NULL,
`ad_strength` int(11) unsigned NOT NULL,
`prev_ad_strength` int(11) unsigned DEFAULT NULL,
`gravity` int(11) unsigned DEFAULT NULL,
PRIMARY KEY (`creative_id`),
KEY `gravity` (`gravity`)
) ENGINE=InnoDB AUTO_INCREMENT=173037591 DEFAULT CHARSET=utf8


I'm concerned about
Using temporary; using filesort
when launching both with
GROUP BY
and
ORDER BY
on another column. If I remove
ORDER BY
, the temporary and filesort go away and the query runs really fast.

What I don't understand, why mysql needs temporary table, why can't it first where filter + sort by
c.gravity
, then group by the resulting table and filter according to
HAVING
clause. The filtered table will be sorted by
c.gravity
correctly as the gravity value remains unchanged after the grouping and having filter.

What I tried:


  1. Selected everything without
    ORDER BY
    , wrapped into a subquery and joined again on
    creatives
    table - same result, using temporary, filesort and slow

  2. tried to add
    FORCE USE INDEX FOR ORDER BY (gravity)
    and it doesn't change anything.
    EXPLAIN
    and execution time remain the same.



UPDATE: the question has been answered by @Rick and it's really much faster with his correlated subquery and not using
GROUP BY
. I'm adding here an
EXPLAIN
output for the query:

enter image description here

And the output of
SHOW CREATE TABLE term_relationships
with the newly created index:

enter image description here

And one more question to @Rick: why do we need the outer query with
c3
? It seems just to join
creatives
on its own one more just to get the values from other columns and order the records by gravity. However, they are already sorted with the inner query and we can easily add missing columns in
c1
making it:

SELECT c1.creative_id,c1.creative_title,c1.creative_image_name,c1.gravity, c1.ad_strength
FROM creatives AS c1
WHERE
( SELECT COUNT(*)
FROM term_relationships
WHERE c1.creative_id = creative_id
AND term_id IN ( 14, 1, 50, 76, 104 )
) = 5
ORDER BY c1.gravity ASC
LIMIT 30;


Is my understanding correct or am I missing something in your query?

Answer

Temp table and filesort are not the villains, per se. It's how bulky they are.

This may look more complex, but it may be faster:

SELECT  c3.creative_id,
        c3.creative_title, c3.creative_image_name,
        c3.gravity, c3.ad_strength
    FROM  
      ( SELECT  creative_id
            FROM  creatives AS c1
            WHERE  
              ( SELECT  COUNT(*)
                    FROM  term_relationships
                    WHERE  c1.creative_id = creative_id
                      AND  term_id IN ( 14, 1, 50, 76, 104 )
              ) = 5 
            ORDER BY  c1.gravity ASC
            LIMIT  30
      ) AS c2
    JOIN  creatives c3 USING (creative_id)
    ORDER BY  c3.gravity 

If it happens to use INDEX(gravity) for the inner query, then it will stop after finding 30 rows that have all 5 transactions. If it generates a tmp table, it will be only 30 rows -- much better than with your original query. Note also, that the tmp table will be narrower -- only creative_id will be in it. Finally it reaches back into creatives to get the rest of the desired columns. Finally, there will be another sort, but with only 30 rows.

Furthermore, "filesort" is often a very fast sort in RAM, not really a "file" sort. I'm pretty sure my query will not be on disk.

term_relationships needs this composite index: INDEX(creative_id, term_id).