John Robertson John Robertson - 2 months ago 18
SQL Question

Random table sampling query in SQLite

I can write your typical "SELECT A,C FROM B WHERE B.id = 10" query, but have otherwise made very little use of SQL, and what I need to do with this query is well beyond my SQL experience at this point.

I have a table T with primary key Tid and a special column S in SQLite. I want to create another table which is a single column made up of values of Tid. I need my new table to have 1000 random values of Tid for EACH possible value of S.

So if I was to group T by S then my new table would represent 1000 random principle keys from each group.

What SQLlite query will create such a table?

Note: Table T is huge, so better performing queries are better.

Answer

Update: I came up with a simpler solution than my first one.

First, we create a table to put the result in:

CREATE TABLE result(tid integer primary key, s);

At the end of this process, result will contain 1000 rows from each group of each unique s value from the original table t.

The basic idea is to create a "magic" special view with one column s, that has the property that each time we try to insert any value of s into it, instead, 1000 random rows from t that have the same value for s will be selected and inserted into result. We can do that with a trigger:

CREATE TEMP VIEW result_view AS SELECT null AS s WHERE 0;

CREATE TEMP TRIGGER result_ins_trigger
INSTEAD OF INSERT ON result_view FOR EACH ROW
BEGIN
    INSERT INTO result(tid, s)
    SELECT * FROM t WHERE t.s=NEW.s ORDER BY random() LIMIT 1000;
END;

Now that we have that view, we try to insert all distinct values of s into it and the trigger will do the rest:

INSERT INTO result_view SELECT DISTINCT s FROM t;

Note that the process will be significantly faster if you have an index on t(s), so might want to create that first (and then drop it if you don't need it).

Essentially, the above solution is a poor man's implementation of LATERAL subqueries, using triggers.


Original answer: Here's a solution that uses a temporary trigger. First, we create a table to put the result in:

CREATE TABLE result(tid integer primary key, s, num NOT NULL);

At the end of this process, result will contain 1000 rows from each group of each unique s value from the original table t. The table also has an auxiliary column called num. Then we do the following:

CREATE INDEX result_s_idx ON result(s, num);

CREATE TEMP VIEW result_view AS SELECT null AS TID, null AS s, NULL AS num WHERE 0;

CREATE TEMP TRIGGER result_ins_trigger
INSTEAD OF INSERT ON result_view FOR EACH ROW
BEGIN
    INSERT INTO result(tid, s, num)
    SELECT NEW.tid, NEW.s,
           COALESCE((SELECT max(num) FROM result WHERE s=NEW.s),0)+1 AS num
    WHERE num<=1000;
END;

INSERT INTO result_view(tid, s, num)
SELECT tid,s,null FROM t ORDER BY random();

DROP INDEX result_s_idx;

The idea is to try insert everything from t into result, in a random order, assigning to each row a number (stored in the column num) that indicates what is the position (in the random order) of that element within that group. So e.g. if num is 11, then that means that row is the 11th row of that group. Additionally, if the number becomes larger than 1000 then we discard the row, because we only need 1000 rows per group. In order to impelemt the previous logic, we use a trigger so that for each row that we try to insert, it runs a subquery to calculate the right value of num and discards the row if num is greater than 1000. Because sqlite does not allow INSTEAD OF triggers on regular tables, we create that trigger on a dummy temporary view. Finally, we insert all the row from t into the dummy view (so that the trigger kicks in) in a random order.

Essentially, the above solution is a poor man's implementation of the row_number window function, using triggers.

One last issue, is the auxiliary column num of result. If you are bothered by it, it can be removed by creating a new table without it and copying all data from result.