Josh Josh - 7 months ago 9
SQL Question

Adding limit clause to MySQL query slows it down dramatically

I'm trying to troubleshoot a performance issue on MySQL, so I wanted to create a smaller version of a table to work with. When I add a LIMIT clause to the query, it goes from about 2 seconds (for the full insert) to astronomical (42 minutes).

mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id;
+------------+-------------+
| 1002395119 | 2012-05-14 |
...
| 1002395157 | 2012-05-14 |
| 1002395187 | 2012-05-14 |
| 1002395475 | 2012-05-14 |
+------------+-------------+
105776 rows in set (2.19 sec)

mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id limit 1;
+------------+-------------+
| player_id | insert_date |
+------------+-------------+
| 1000000080 | 2012-05-14 |
+------------+-------------+
1 row in set (42 min 23.26 sec)

mysql> describe player_record;
+------------------------+------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+------------------------+------+-----+---------+-------+
| player_id | int(10) unsigned | NO | PRI | NULL | |
| insert_date | date | NO | PRI | NULL | |
| xp | int(10) unsigned | YES | | NULL | |
+------------------------+------------------------+------+-----+---------+-------+
17 rows in set (0.01 sec) (most columns removed)


There are 20 million rows in the player_record table, so I am creating two tables in memory for the specific dates I am looking to compare.

CREATE temporary TABLE date_curr
(
player_id INT UNSIGNED NOT NULL,
insert_date DATE,
PRIMARY KEY player_id (player_id, insert_date)
) ENGINE=MEMORY;
INSERT into date_curr
SELECT player_id,
MAX(insert_date) AS insert_date
FROM player_record
WHERE insert_date BETWEEN '2012-05-15' AND '2012-05-15' + INTERVAL 6 DAY
GROUP BY player_id;

CREATE TEMPORARY TABLE date_prev LIKE date_curr;
INSERT into date_prev
SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER join date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id limit 0,20000;


date_curr has 216k entries, and date_prev has 105k entries if I don't use a limit.

These tables are just part of the process, used to trim down another table (500 million rows) to something manageable. date_curr includes the player_id and insert_date from the current week, and date_prev has the player_id and most recent insert_date from BEFORE the current week for any player_id present in date_curr.

Here is the explain output:

mysql> explain SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER JOIN date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id
LIMIT 0,20000;
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | pr | range | PRIMARY,insert_date | insert_date | 3 | NULL | 396828 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 216825 | Using where; Using join buffer |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
2 rows in set (0.03 sec)


This is on a system with 24G RAM dedicated to the database, and currently is pretty much idle. This specific database is the test so it is completely static. I did a mysql restart and it still has the same behavior.

Here is the 'show profile all' output, with most time being spent on copying to tmp table.

| Status | Duration | CPU_user | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Messages_sent | Messages_received | Page_faults_major | Page_faults_minor | Swaps | Source_function | Source_file | Source_line |
| Copying to tmp table | 999.999999 | 999.999999 | 0.383941 | 110240 | 18983 | 16160 | 448 | 0 | 0 | 0 | 43 | 0 | exec | sql_select.cc | 1976 |

Answer

A bit of a long answer but I hope you can learn something from this.

So based on the evidence in the explain statement you can see that there was two possible indexes that the MySQL query optimizer could have used they are as follows:

possible_keys
PRIMARY,insert_date 

However the MySQL query optimizer decided to use the following index:

key
insert_date

This is a rare occasion where MySQL query optimizer used the wrong index. Now there is a probable cause for this. You are working on a static development database. You probably restored this from production to do development against.

When the MySQL optimizer needs to make a decision on which index to use in a query it looks at the statistics around all the possible indexes. You can read more about statistics here http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-other-changes-statistics-estimation.html for a starter.

So when you update, insert and delete from a table you change the index statistics. It might be that the MySQL server because of the static data had the wrong statistics and chose the wrong index. This however is just a guess at this point as a possible root cause.

Now lets dive into the indexes. There was two possible indexes to use the primary key index and the index on insert_date. MySQL used the insert_date one. Remember during a query execution MySQL can only use one index always. Lets look at the difference between the primary key index and the insert_date index.

Simple fact about a primary key index(aka clustered):

  1. A primary key index is normally a btree structure that contains the data rows i.e. it is the table as it contains the date.

Simple fact about secondary index(aka non-clustered):

  1. A secondary index is normally a btree structure that contains the data being indexed(the columns in the index) and a pointer to the location of the data row on the primary key index.

This is a subtle but big difference.

Let me explain when you read a primary key index you are reading the table. The table is in order of the primary index as well. Thus to find a value I would search the index read the data which is 1 operation.

When you read a secondary index you search the index find the pointer then read the primary key index to find the data based on the pointer. This is essentially 2 operations making the operation of reading a secondary index twice as costly as reading the primary key index.

In your case since it chose the insert_date as the index to use it was doing double the work just to do the join. That is problem one.

Now when you LIMIT a recordset it is the last piece of execution of the query. MySQL has to take the entire recordset sort it (if not sorted allready) based on ORDER BY and GROUP BY conditions then take the number of records you want and send it back based on the LIMIT BY section. MySQL has to do a lot of work to keep track of records to send and where it is in the record set etc. LIMIT BY does have a performance hit but I suspect there might be a contributing factor read on.

Look at your GROUP BY it is by player_id. The index that is used is insert_date. GROUP BY essentially orders your record set, however since it had no index to use for ordering (remember a index is sorted in the order of the column(s) contained in it). Essentially you were asking sort/order on player_id and the index used was sorted on insert_date.

This step caused the filesort problem which essentially takes the data that is returned from reading the secondary index and primary index(remember the 2 operations) and then has to sort them. Sorting is normally done on disk as it is a very very expensive operation to do in memory. Thus the entire query result was written to disk and sorted painfully slow to get you your results.

By removing the insert_date index MySQL will now use the primary key index which means the data is ordered(ORDER BY/GROUP BY) player_id and insert_date. This will eliminate the need to read the secondary index and then use the pointer to read the primary key index i.e. the table, and since the data is already sorted MySQL has very little work when applying the GROUP BY piece of the query.

Now the following is a bit of a educated guess again if you could post the results of the explain statement after the index was dropped I would probably be able to confirm my thinking. So by using the wrong index the results were sorted on disk to apply the LIMIT BY properly. Removing the LIMIT BY allows MySQL to probably sort in Memory as it does not have to apply the LIMIT BY and keep track of what is being returned. The LIMIT BY probably caused the temporary table to be created. Once again difficult to say without seeing the difference between the statements i.e. output of explain.

Hopefully this gives you a better understanding of indexes and why they are a double edged sword.