Hubert Schölnast Hubert Schölnast - 13 days ago 11
MySQL Question

SQL: Select only 1st row from lexical ordered table

In a nutshell



How can you speed up this statement (running on a table with very much rows)?:

select * from mytable where val2=4 order by key1, key2, key3 limit 1;





In detail



This is my table (here displayed sorted lexically by its three keyfields) from which I want to select the one row that I have marked with an arrow. There are 3 fields in the primary index: key1, then key2, then key3.

Know that my real table has more columns and about 100,000 rows (and an index on column val2).

key1 | key2 | key3 | val1 | val2
-----+------+------+------+------
2 | 1 | 0 | 1 | 1
3 | 1 | 0 | 2 | 2
3 | 2 | 0 | 3 | 3
3 | 2 | 1 | 1 | 4 <==
4 | 1 | 0 | 2 | 5
4 | 2 | 0 | 3 | 1
4 | 2 | 1 | 1 | 2
4 | 3 | 0 | 2 | 3
4 | 3 | 1 | 3 | 4
4 | 3 | 2 | 1 | 5
5 | 1 | 0 | 2 | 1
5 | 2 | 0 | 3 | 2
5 | 2 | 1 | 1 | 3
5 | 3 | 0 | 2 | 4
5 | 3 | 1 | 3 | 5
5 | 3 | 2 | 1 | 1
5 | 4 | 0 | 2 | 2
5 | 4 | 1 | 3 | 3
5 | 4 | 2 | 1 | 4
5 | 4 | 3 | 2 | 5


This is the statement that exactly delivers the wanted row, and also explains what I want in detail:

select * from mytable where val2=4 order by key1, key2, key3 limit 1;


I want to do this (in sequential pseudocode):

1. Select all rows which have the value 4 in field val2.
2. Sort those rows by key1, then by key2, then by key3
3. Return only the first single row of this sorted set of rows


My select statement needs to read the whole table, and then has to sort a huge amount of rows before it can find the one row that I want.

I think this could be done quicker with nested subselects (i know this syntax is wrong, but I hope you understand what i want to do):

select * from mytable where key1+key2+key3 = (
select key1, key2, min(key3) from mytable where val2=4 and key1+key2 = (
select key1, min(key2) from mytable where val2=4 and key1 = (
select min(key1) from mytable where val2=4
)
)
)


But I don't know how to write this in a correct sql syntax, and I'm not sure if this really is a better way. I think, there must be an elegant solution using joins (joining a table with itself), but I can't find such an solution.

Can you help, please?




EDIT (after comments)



Ok, let's talk about my real table:

At the moment, there is only one row in this table, and it has not 3 but 2 key-fields. But this table will grow in an iterative process, where one row has to be selected using the statement we are discussing about now. This row will be processed, and as a result of this process, this row will be updated. Plus: Between 0 and 2 new rows will be inserted. Then it repeats: A new row will be selected, analyzed and updated, and again between 0 and 2 new rows will be inserted.

At the beginning this process will add lots of new rows, that need to be read later. At the end hopefully this process stops, because there are no more rows that match to the WHERE-clause. Then the remaining rows have to be analyzed.

So, this are the statements that create the table and insert the starting-row:

CREATE TABLE `numbers` (
`a0` int(10) UNSIGNED NOT NULL DEFAULT '0',
`b0` int(10) UNSIGNED NOT NULL DEFAULT '0',
`n` int(10) UNSIGNED NOT NULL DEFAULT '0',
`an` int(10) UNSIGNED NOT NULL DEFAULT '0',
`bn` int(10) UNSIGNED NOT NULL DEFAULT '0',
`m` double NOT NULL DEFAULT '0',
`gele` char(1) NOT NULL DEFAULT '?'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

INSERT INTO `numbers` (`a0`, `b0`, `n`, `an`, `bn`, `m`, `gele`) VALUES
(1, 0, 0, 0, 0, 0, '?');

ALTER TABLE `numbers`
ADD PRIMARY KEY (`a0`,`b0`),
ADD KEY `gele` (`gele`);


Here is my statement:

SELECT `a0`, `b0`, `n`, `an`, `bn`, `m`, `gele`
FROM `numbers`
WHERE `gele` = '?' OR `gele` = '='
ORDER BY `a0`, `b0`
LIMIT 1;


And this is the result of
EXPLAIN SELECT ....
:

id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
1 | SIMPLE | numbers | NULL | index | gele | PRIMARY | 8 | NULL | 1 | 100.00 | Using where


Since there is only 1 row in the table at the moment, the result of the explain statement is not very helpful, sorry.

But anyway: I want a more generic answer to this problem, because it occurs very often.

Answer

First of all, regardless of how the records are laid out on disk, you must use ORDER BY to guarantee the order of records from a SELECT. The Optimizer will (usually) notice the order of the records and can decide to do 'nothing' for the ORDER BY.

In InnoDB, records are arranged according to the PRIMARY KEY. So, given PRIMARY KEY (a0,b0) and ORDER BY a0, b0, the Optimizer may simply read the rows in order without having to do a sort.

But... If you have a WHERE clause that, say, says WHERE c0 > 3 and you have INDEX(c0, b0), the Optimizer is likely to use the index for filtering, then have to sort, even if you say ORDER BY a0, b0. This is likely to be faster than doing a table scan (to avoid the sort) and filter as it steps through all the rows (to perform the WHERE).

Your

  1. Select all rows which have the value 4 in field val2.
  2. Sort those rows by key1, then by key2, then by key3
  3. Return only the first single row of this sorted set of rows

is very simply, and very efficiently, done via

INDEX(val2, key1, key2, key3)

SELECT ...
    WHERE val2 = 4                -- filter column goes first
    ORDER BY key1, key2, key3     -- sort columns next
    LIMIT 1

It will read exactly one 'row' from that composite index, then look up the row in the data (using the PRIMARY KEY). Both are "point queries", using a BTree index. We are talking a few milliseconds, even if nothing is cached, regardless of table size.

See my cookbook on building indexes.

But your 'real' query is not the same pattern; it has an 'OR'

SELECT  `a0`, `b0`, `n`, `an`, `bn`, `m`, `gele`
    FROM  `numbers`
    WHERE  `gele` = '?'
       OR  `gele` = '='
    ORDER BY  `a0`, `b0`
    LIMIT  1;

INDEX(gele, a0, b0) is tempting, but it won't work. All the '?' values are nicely ordered according to a0, b0, and so are the '=' values. But you want both sets. This involves "merging" two sorted lists. The Optimizer has a way to do it, but it is rarely worth the effort. It turns out that there are two possibly 'best' indexes, and the Optimizer cannot always decide correctly between them:

INDEX(gele)  -- do all the filtering; sort later
INDEX(a0,b0) -- avoids sorting, but requires reading an indeterminate number of rows

Since the latter is your PK, and there is some advantage in using the PK, that is what the Optimizer picked. If no '?' nor '=' occurs until the 'last' row in the table, the query will read the entire table. :(

One trick that is sometimes worth doing is to turn OR into UNION:

    (  SELECT  `a0`, `b0`, `n`, `an`, `bn`, `m`, `gele`
            FROM  `numbers`
            WHERE  `gele` = '?'
            ORDER BY  `a0`, `b0`
            LIMIT  1 )            -- Step 1, below
UNION ALL
    (  SELECT  `a0`, `b0`, `n`, `an`, `bn`, `m`, `gele`
            FROM  `numbers`
            WHERE  `gele` = '='
            ORDER BY  `a0`, `b0`
            LIMIT  1 )            -- Step 2
ORDER BY  a0, b0 -- yes repeated  -- Step 3
LIMIT  1;                         -- Step 4

INDEX(gele, a0, b0)

This is guaranteed to be fast, but it has some overhead:

  1. Search for '?' -- find the row promptly. Write to tmp table.
  2. Search for '=' -- find the row promptly. Append to tmp table.
  3. Sort the tmp table.
  4. Peel off 1 row.

Yes, there is a 'temp' table and 'filesort', but with only 2 rows, it is very fast. This particular formulation works fast regardless of the table size.