fcoleads fcoleads - 1 month ago 7
MySQL Question

Optimizing large MySQL query (73MM rows) for most recent event group by

I'm trying to grab the most recent event for each 'lead'. I have created indexes and this query still will take well over 30 minutes.

SELECT l.id,
l.home_number,
l.mobile_number,
CASE WHEN l.soldprice < 2 THEN 0 ELSE 1 END as sold,
l.lead_date
FROM (
SELECT l.home_number, MAX(l.id) as id
FROM lead l
WHERE l.lead_date >= DATE_SUB(NOW(), INTERVAL 52 WEEK)
AND l.state NOT IN ('NY','AR','VT','WV','GA','CT','DC','SD')
GROUP BY l.home_number) a
JOIN lead l ON l.id=a.id;


My table indexes are below:

Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_typ
lead 0 PRIMARY 1 id A 63123648 NULL NULL BTREE
lead 1 id 1 id A 63266540 NULL NULL BTREE
lead 1 soldprice 1 soldprice A 14715 NULL NULL YES BTREE
lead 1 lead_date 1 lead_date A 15351477 NULL NULL YES BTREE


And my table schema:

CREATE TABLE lead
(
id BIGINT unsigned NOT NULL,
lead_date DATETIME NULL,
first_name VARCHAR(50) NULL,
last_name VARCHAR(50) NULL,
hashed_ssn VARCHAR(34) NULL,
city VARCHAR(50) NULL,
state VARCHAR(2) NULL,
home_number VARCHAR(10) NULL,
mobile_number VARCHAR(10) NULL,
email VARCHAR(255) NULL,
soldprice DECIMAL(5,2) NULL,
requested_amount INT NULL,
time_zone VARCHAR(5),
camp_id VARCHAR(9),
leadtype_id VARCHAR(3),
hittype_id VARCHAR(3),
PRIMARY KEY (id)
);


Any suggestions would be greatly appreciated.

EDIT: I'm using MySQL version 5.7.19-0ubuntu0.16.04.1

Answer Source

Tl;dr You need a compound (multi-column) index.

Pro tip: Don't create lots of single column indexes unless you know you need them. They rarely help much in complex queries, and they slow down insertions and updates.

You've done a good job of using a subquery to winnow down the id values for the rows you want to fetch. Still, surely most of the time is going into your subquery, this:

SELECT  l.home_number, MAX(l.id) as id
FROM lead l
WHERE l.lead_date >= DATE_SUB(NOW(), INTERVAL 52 WEEK)
AND l.state NOT IN ('NY','AR','VT','WV','GA','CT','DC','SD')
GROUP BY l.home_number

It's usually smart to debug subqueries, then join them into main queries.

The first thing to do is this: Create a compound index on (lead_date, home_number, id). Then run this simplified subquery, omitting the exclusion on states. This should be fast, because it can random access the dates, then use the index to handle the grouping, and a loose index scan to grab the max id values.

SELECT  l.home_number, MAX(l.id) as id
FROM lead l
WHERE l.lead_date >= DATE_SUB(NOW(), INTERVAL 52 WEEK)
GROUP BY l.home_number

Next, try creating a compound index on (lead_date, state, home_number, id) and try your original query. If it is reasonably fast, you're done. Your query will be much quicker. Drop the first compound index.

But it might not be, because MySQL doesn't really do well with NOT IN clauses in high volume.

In that case, keep the first compound index and drop the second one, and move your state exclusion to your outer query.

That will look like this:

SELECT  l.id,
        l.home_number,
        l.mobile_number,
        CASE WHEN l.soldprice < 2 THEN 0 ELSE 1 END as sold,
        l.lead_date
FROM (
    SELECT  l.home_number, MAX(l.id) as id
    FROM lead l
    WHERE l.lead_date >= DATE_SUB(NOW(), INTERVAL 52 WEEK)
    GROUP BY l.home_number) a 
JOIN lead l ON l.id=a.id
WHERE l.state NOT IN ('NY','AR','VT','WV','GA','CT','DC','SD')

That should help.

http://use-the-index-luke.com/ is a good reference for this kind of work.