Secto Kia Secto Kia - 3 months ago 14
MySQL Question

What indexes to create to speed up my heavy filter and grouping query?

I have a table like this, it has millions of records:

CREATE TABLE `myTable` (
`DateTime` DATETIME NOT NULL,
`Col1` MEDIUMINT UNSIGNED NOT NULL,
`Col2` MEDIUMINT UNSIGNED NOT NULL,
`Col3` MEDIUMINT UNSIGNED NOT NULL,
`Col4` MEDIUMINT UNSIGNED NOT NULL,
`Event` MEDIUMINT UNSIGNED NOT NULL,
`State` MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY (`DateTime`,`Col4`,`Event`,`State`)
);


I run a query to count the number of records, grouped by a time period, for a time range, if they match 'filters' based on the Col1/Col2/Col3/Col4 values. For example, a 3 minute period:

select
FROM_UNIXTIME(UNIX_TIMESTAMP(MIN(`DateTime`))-(UNIX_TIMESTAMP(MIN(`DateTime`)) % (3*60))) as 'Period',
count(*) as 'NumberOfRecords'
from
`myTable`
where
`DateTime` > '2016-09-01' and `DateTime` < '2016-09-09'
AND `Col1` IN (3, 6, 11, 14, etc... )
AND `Col2` IN (5 ,25 , 325 , 293, 294, etc.... )
AND `Col3` IN (3 , 9 , 95 , 395 , 435, etc...)
AND `Col4` IN (124, 125, 135, 325, etc...)
group by
UNIX_TIMESTAMP(`DateTime`) DIV (3*60);


What index should I have to speed up this query? I don't care how slow insertion gets, I want to have the query run extremely fast.

In general there are around 1,000 unique values for each col1,col2,col3,col4, but there are millions of records that would fit in the date range.

I was thinking something like:

CREATE INDEX `myIndex` ON `myTable` ( `DateTime`, `Col`,`Col2`,`Col3`,`Col4 )


But I am not sure I have the ordering right? Or is it better to make 4 indexes, one for each (
DateTime
,
ColX
)?

Answer

This query is very hard to optimize given your five-column table structure, because you're running up to six different range predicates.

A range predicate includes operations >, <, <>, BETWEEN, LIKE, or IN(). Basically, any type of search other than =.

A range predicate potentially matches many values in the column.

An equality predicate matches exactly one value in the column (there may be many rows with that value, but it's one value).

When defining indexes, the columns you put in the index should be columns references in equality comparisons first, then just one column referenced in a range predicate. Any additional columns in the index beyond the first column referenced in a range predicate will not count for doing the lookup.

For example, if you have an index on (col1, col2, col3), the following conditions:

WHERE col1=123 AND col2 IN (4, 5, 6) AND col3=789

This query can make use of the first two columns of the index. The col3 will not use the index. The query will examine all rows matched by the first two terms, and evaluate the third term against all those rows one by one.

Whereas the same index will use all three columns for lookup by the following conditions:

WHERE col1=123 AND col2=789 AND col3 IN (4, 5, 6)

That is, equality predicates for the first two columns, and a range predicate for the last column in the index.

When you use EXPLAIN, one of the columns reports the number of bytes of an index entry. In the above example, suppose all three columns are 32-bit integer columns. EXPLAIN for the first query will report that it uses 8 bytes (two integer's worth), and EXPLAIN for second query will report it uses 12 bytes (three integer's worth).

In your case, you have range predicates for all the terms in your conditions. This is not optimizable with a B-Tree index. It could use an index for any one of the columns. So you might create five individual indexes, each on one of the columns, and hope that the optimizer picks the one that narrows down the search most effectively. Or you can use index hints to pick the best index yourself.

I wrote a presentation you may find interesting called How to Design Indexes, Really . Here's a recording of my presentation: https://www.youtube.com/watch?v=ELR7-RdU9XU

There are other types of indexes besides B-tree indexes. A range predicate search over multiple columns may require an R-tree index. So you may find that to really optimize this query, you need to load a copy of the data into Apache Solr or Crate or some other search engine like that.