neverlastn neverlastn - 7 months ago 22
SQL Question

`MySQL GROUP BY is slower when using index

I'm running on an AWS m4.large (2 vCPUs, 8 GB ram) and I'm seeing a slightly surprising behaviour regarding MySQL and GROUPBY's. I have this test database:

CREATE TABLE demo (
time INT,
word VARCHAR(30),
count INT
);
CREATE INDEX timeword_idx ON demo(time, word);


I insert 4,000,000 records with (uniformly) random words
"t%s" % random.randint(0, 30000)
and times
random.randint(0, 86400)
.

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (1 min 28.29 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| 1 | SIMPLE | demo | index | NULL | timeword_idx | 38 | NULL | 4002267 | |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+


and then I don't use the index:

SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
3996922 rows in set (34.75 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | demo | ALL | NULL | NULL | NULL | NULL | 4002267 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+


As you can see by using the index the query takes 3 times more time. I'm not that surprised since by using the index the query might have to avoid reading the
time
and
word
columns but unfortunately by the index being so sparse it shouldn't gain much. On the contrary it turns a direct scan to a random access pattern when it comes to retrieving
count
.

I just would like to confirm that this is the reason and wonder if there's a "compact rule" on when and index will bring eventually worse performance when used for GROUP BY.

EDIT:

I followed Gordon Linoff's answer and used:

CREATE INDEX timeword_idx ON demo(time, word, count);


The "covering index" computes the results 10 times faster when compared with the full scan:

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (3.36 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| 1 | SIMPLE | demo | index | NULL | timeword_idx | 43 | NULL | 4002267 | Using index |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+


Very impressive!

Answer

You have a reasonably sized table, so the problem might be sequential access of the data or thrashing. Using the index requires going through the index and then looking up the data in the data pages to get the count.

This can actually be worse than just reading the pages and doing a sort, because the pages are not read in order. Sequential reads are considerably more optimized than random reads. In the worst case, the page cache is full and the random reads require flushing pages. If this happens, a single page might need to be read multiple times. With only 4 million relatively small rows, thrashing is unlikely unless you are severely memory constrained.

If this interpretation is correct, then including count in the index should speed the query:

CREATE INDEX timeword_idx ON demo(time, word, count);