SebastianRiemer SebastianRiemer - 27 days ago 12
MySQL Question

Get 2 rows for each distinct value in column

I have a table containing Queue-Items which are periodically (500ms) worked off in a batch of 10.

select * from tbl_queue order by prio asc;


----------------------------
client_id |object_key |prio
----------------------------
1 |4711 | 10
1 |2831 | 10
1 |1912 | 10
1 |1913 | 10
1 |1914 | 10
1 |1915 | 10
1 |1922 | 10
1 |1933 | 10
1 |1944 | 10
1 |1955 | 10
1 |1966 | 10
2 |7861 | 10
2 |1234 | 10
3 |5463 | 10
3 |5464 | 10
4 |7341 | 10
4 |7342 | 10
5 |9425 | 10
5 |9426 | 10
5 |9427 | 10


Each time I select from the table I limit the amount I want to work on:

select * from tbl_queue order by prio asc limit 10;





client_id |object_key |prio
----------------------------
1 |4711 | 10
1 |2831 | 10
1 |1912 | 10
1 |1913 | 10
1 |1914 | 10
1 |1915 | 10
1 |1922 | 10
1 |1933 | 10
1 |1944 | 10
1 |1955 | 10


However, I would like to treat each client equally, thus my desired result would be:

----------------------------
client_id |object_key |prio
----------------------------
1 |1913 | 10
1 |1966 | 10
2 |7861 | 10
2 |1234 | 10
3 |5463 | 10
3 |5464 | 10
4 |7341 | 10
4 |7342 | 10
5 |9425 | 10
5 |9426 | 10


Note that I do not care which object_key is chosen. There are two important requirements:


  1. if there are items from different clients present, items of each
    client must be selected (preferably equally).

  2. the query must always return 10 rows,
    except if there are less than 10 rows.



The proposed solution must work in Mysql and PostgreSQL and it must not be too expensive to insert and select from the table.

I've already thought of adding a column like sort_idx and insert the actual row-count per client in each row, so I'd be able to do this:

--------------------------------------
client_id |object_key |prio| sort_idx
--------------------------------------
1 |4711 | 10 | 1
1 |2831 | 10 | 2
1 |1912 | 10 | 3
1 |1913 | 10 | 4
1 |1914 | 10 | 5
1 |1915 | 10 | 6
1 |1922 | 10 | 7
1 |1933 | 10 | 8
1 |1944 | 10 | 9
1 |1955 | 10 | 10
1 |1966 | 10 | 11
2 |7861 | 10 | 1
2 |1234 | 10 | 2
3 |5463 | 10 | 1
3 |5464 | 10 | 2
4 |7341 | 10 | 1
4 |7342 | 10 | 2
5 |9425 | 10 | 1
5 |9426 | 10 | 2
5 |9427 | 10 | 3

select * from tbl_queue order by prio, sort_index asc limit 10;

--------------------------------------
client_id |object_key |prio| sort_idx
--------------------------------------
1 |4711 | 10 | 1
2 |7861 | 10 | 1
3 |5463 | 10 | 1
4 |7341 | 10 | 1
5 |9425 | 10 | 1
1 |2831 | 10 | 2
2 |1234 | 10 | 2
3 |5464 | 10 | 2
4 |7342 | 10 | 2
5 |9426 | 10 | 2


However, I am not too confident about efficiently calculating a sort-index for each insert. There might be situations where 20.000 rows are inserted at once (or in a rapid fashion).

I am not looking for solutions introducing some variables or functions, I neither want to introduce database triggers, just plain sql.

EDIT 13.09.2017 17:13: If using user-defined variables is possible within the context of springs' JdbcTemplate I guess it'd be an acceptable solution.

However I'd like to keep things as simple as possible on the database layer (i.e. not use database-specific/exclusive commands)

EDIT 26.09.2017 11:18
So I've tested the suggested solution against a real life data set with about 40.000 records and I had to find out that it does not perform well. In fact I canceled query execution after about a minute or so.

An explain on this query gives the following output:

explain select c1.client_id,
c1.object_key,
c1.prio,
count(*)-1 as sort_idx
from clients as c1
left join clients as c2
on c1.client_id = c2.client_id
and c1.object_key >= c2.object_key
and c1.prio >= c2.prio
group by c1.client_id,
c1.object_key,
c1.prio
order by
c1.prio,
sort_idx
limit 10;



Limit (cost=512.53..512.56 rows=10 width=12)
-> Sort (cost=512.53..513.03 rows=200 width=12)
Sort Key: c1.prio, ((count(*) - 1))
-> HashAggregate (cost=505.71..508.21 rows=200 width=12)
Group Key: c1.client_id, c1.object_key, c1.prio
-> Nested Loop Left Join (cost=0.15..484.80 rows=2091 width=12)
-> Seq Scan on clients c1 (cost=0.00..29.40 rows=1940 width=12)
-> Index Only Scan using clients_idx on clients c2 (cost=0.15..0.22 rows=1 width=12)
Index Cond: ((client_id = c1.client_id) AND (object_key <= c1.object_key) AND (prio <= c1.prio))


I'm not an expert on interpreting explain plans, but whenever the cost-number is high or I see nested loop join I sense danger.

The statement given with the dense rank function works a lot faster, unfortunately I need to support mysql and postgres.

Answer Source

Usually (MSSQL, Oracle, PostgreSQL, etc.) this problem can be solved with the help of DENSE_RANK function. But since MySQL doesn't have window functions, you can do something like this:

select c1.client_id,
  c1.object_key,
  c1.prio,
  count(*)-1 as sort_idx
from clients as c1
  left join clients as c2
    on c1.client_id = c2.client_id
      and c1.object_key >= c2.object_key
      and c1.prio >= c2.prio
group by c1.client_id,
        c1.object_key,
        c1.prio
order by
  c1.prio,
  sort_idx
limit 10;

I've made a fiddle, you're welcome to test it

UPDATE: Also made a solution for PostgreSQL with DENSE_RANK - just in case someone else wouldn't be restricted with MySQL limitations:

select c.client_id,
  c.object_key,
  c.prio,
  DENSE_RANK() OVER   
    (PARTITION BY c.client_id ORDER BY c.object_key) as sort_idx
from clients as c
order by
  c.prio,
  sort_idx
limit 10;

And the fiddle