Anthony Geoghegan Anthony Geoghegan - 1 year ago 66
MySQL Question

Query to check for commonality (Intersection) between each possible pair combination

I wrote a program to generate tests composed of a combination of questions taken from a large pool of questions. There were a number of criteria for each test and the program saved them to database only if they satisfied these criteria.

My program was written to ensure as even a distribution of questions as possible, i.e., when generating combinations of questions, the algorithm prioritise questions from the pool that have been asked the least number of times in previous iterations.

I created one table,

to essentially store the
for each test and another,
to store
s and their corresponding
s using n rows per test (where n is the number of questions in each test).

Now that I have the tests stored in a database, I’d like to check that the overlap of questions between different pairs of test are within certain bounds and I thought I should be able to do this using SQL.

Using a self-join, I was able to use this query to select the questions common to Test 3 and Test 5:

-- Get the number of questions that are common to tests 3 and 5
SELECT count(tq1.question_id) AS Overlap
FROM test_questions AS tq1
JOIN test_questions AS tq2
ON tq1.question_id = tq2.question_id
WHERE tq1.test_id = 5
AND tq2.test_id = 3;

I was able to generate each possible combination of test pairs from the first n (5) tests:

-- Get all combinations of pairs of tests from 1 to 5
SELECT t1.test_id AS Test1, t2.test_id AS Test2
FROM tests AS t1
JOIN tests AS t2
ON t2.test_id > t1.test_id
WHERE t1.test_id <= 5
AND t2.test_id <= 5;

What I’d like to do but so far have failed to do is to combine the above two queries to show each possible pair combination of the first 5 tests – along with the number of questions that are common to both tests.

-- This doesn't work
SELECT t1.test_id AS Test1, t2.test_id AS Test2, count(tq1.question_id) AS Overlap
FROM tests AS t1
JOIN tests AS t2
ON t2.test_id > t1.test_id
JOIN test_questions AS tq1
ON t1.test_id = tq1.test_id
JOIN test_questions AS tq2
ON t2.test_id = tq2.test_id
WHERE t1.test_id <= 11
AND t2.test_id <= 11
GROUP BY t1.test_id, t2.test_id;

I’ve created a simplified version (with randomised data) of the two tables at this SQL Fiddle

Note: I’m using MySQL as my DBMS but the SQL should be compatible with the ANSI standard.

Edit: The program I wrote to generate the tests actually generated more than the number of tests I needed and I only want to compare the first n tests. In the example, I added a
<= 5
WHERE condition to ignore the extra tests.

To clarify what I’m looking for as per Thorsten Kettner’s example data:

test 1: a, b and c
test 2: a, b and d
test 3: d, e and f

The results would be:

Test Test Overlap
Test1 Test2 2 (a and b in common)
Test1 Test3 0 (no questions in common)
Test2 Test3 1 (d is common to both)

Answer Source

You just need a group by to your first query (basically). I also added another condition, so the test ids are produced in order:

SELECT tq1.test_id as test_id1, tq2.test_id as test_id2, count(tq1.question_id) AS Overlap
FROM test_questions tq1 LEFT JOIN
     test_questions tq2
     ON tq1.question_id = tq2.question_id and
        tq1.test_id < tq2.test_id
GROUP BY tq1.test_id, tq2.test_id;

This is standard SQL.

If you want to get all pairs of tests, even those that have no questions in common, here is another approach:

SELECT t1.test_id as test_id1, t2.test_id as test_id2, count(tq2.question_id) AS Overlap
     tests t2 LEFT JOIN
     test_questions tq1
     on t1.test_id = tq1.test_id LEFT JOIN
     test_questions tq2
     ON t2.test_id = tq2.test_id and tq1.question_id = tq2.question_id 
GROUP BY t1.test_id, t2.test_id;

This assumes that you have a table with one row per test. If not, replace tests with (select distinct test from test_questions).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download