Tudor Ciotlos Tudor Ciotlos - 7 months ago 15
SQL Question

Recursive SQL - count number of descendants in hierarchical structure

Consider a database table with the following columns:


  • mathematician_id

  • name

  • advisor1

  • advisor2



The database represents data from the Math Genealogy Project, where each mathematician usually has one single advisor, but there are situations when there are two advisors.

Visual aid to make things clearer:

enter image description here

How do I count the number of descendants for each of the mathematicians?

I should probably use Common Table Expressions (WITH RECURSIVE), but I am pretty much stuck at the moment. All the similar examples I found deal with hierarchies having only one parent, not two.

Update:

I adapted the solution for SQL Server provided by Vladimir Baranov to also work in PostgreSQL:

WITH RECURSIVE cte AS (
SELECT m.id as start_id,
m.id,
m.name,
m.advisor1,
m.advisor2,
1 AS level
FROM public.mathematicians AS m

UNION ALL

SELECT cte.start_id,
m.id,
m.name,
m.advisor1,
m.advisor2,
cte.level + 1 AS level
FROM public.mathematicians AS m
INNER JOIN cte ON cte.id = m.advisor1
OR cte.id = m.advisor2
),
cte_distinct AS (
SELECT DISTINCT start_id, id
FROM cte
)
SELECT cte_distinct.start_id,
m.name,
COUNT(*)-1 AS descendants_count
FROM cte_distinct
INNER JOIN public.mathematicians AS m ON m.id = cte_distinct.start_id
GROUP BY cte_distinct.start_id, m.name
ORDER BY cte_distinct.start_id

Answer

You didn't say what DBMS you use. I'll use SQL Server for this example, but it will work in other databases that support recursive queries as well.

Sample data

I entered only the right part of your tree, starting from Euler.

The most interesting part is the multiple paths between Lagrange and Dirichlet.

DECLARE @T TABLE (ID int, name nvarchar(50), Advisor1ID int, Advisor2ID int);

INSERT INTO @T (ID, name, Advisor1ID, Advisor2ID) VALUES
(1,  'Euler', NULL, NULL),
(2,  'Lagrange', 1, NULL),
(3,  'Laplace', NULL, NULL),
(4,  'Fourier', 2, NULL),
(5,  'Poisson', 2, 3),
(6,  'Dirichlet', 4, 5),
(7,  'Lipschitz', 6, NULL),
(8,  'Klein', NULL, 7),
(9,  'Lindemann', 8, NULL),
(10, 'Furtwangler', 8, NULL),
(11, 'Hilbert', 9, NULL),
(12, 'Taussky-Todd', 10, NULL);

This is how it looks like:

SELECT * FROM @T;

+----+--------------+------------+------------+
| ID |     name     | Advisor1ID | Advisor2ID |
+----+--------------+------------+------------+
|  1 | Euler        | NULL       | NULL       |
|  2 | Lagrange     | 1          | NULL       |
|  3 | Laplace      | NULL       | NULL       |
|  4 | Fourier      | 2          | NULL       |
|  5 | Poisson      | 2          | 3          |
|  6 | Dirichlet    | 4          | 5          |
|  7 | Lipschitz    | 6          | NULL       |
|  8 | Klein        | NULL       | 7          |
|  9 | Lindemann    | 8          | NULL       |
| 10 | Furtwangler  | 8          | NULL       |
| 11 | Hilbert      | 9          | NULL       |
| 12 | Taussky-Todd | 10         | NULL       |
+----+--------------+------------+------------+

Query

It is a classic recursive query with two interesting points.

1) The recursive part of the CTE joins to the anchor part using both Advisor1ID and Advisor2ID:

        INNER JOIN CTE 
            ON CTE.ID = T.Advisor1ID
            OR CTE.ID = T.Advisor2ID

2) Since it is possible to have multiple paths to the descendant, recursive query may output the node several times. To eliminate these duplicates I used DISTINCT in CTE_Distinct. It may be possible to solve it more efficiently.

To understand better how the query works run each CTE separately and examine intermediate results.

WITH
CTE
AS
(
    SELECT
        T.ID AS StartID
        ,T.ID
        ,T.name
        ,T.Advisor1ID
        ,T.Advisor2ID
        ,1 AS Lvl
    FROM @T AS T

    UNION ALL

    SELECT
        CTE.StartID
        ,T.ID
        ,T.name
        ,T.Advisor1ID
        ,T.Advisor2ID
        ,CTE.Lvl + 1 AS Lvl
    FROM
        @T AS T
        INNER JOIN CTE 
            ON CTE.ID = T.Advisor1ID
            OR CTE.ID = T.Advisor2ID
)
,CTE_Distinct
AS
(
    SELECT DISTINCT
        StartID
        ,ID
    FROM CTE
)
SELECT
    CTE_Distinct.StartID
    ,T.name
    ,COUNT(*) AS DescendantCount
FROM
    CTE_Distinct
    INNER JOIN @T AS T ON T.ID = CTE_Distinct.StartID
GROUP BY
    CTE_Distinct.StartID
    ,T.name
ORDER BY CTE_Distinct.StartID;

Result

+---------+--------------+-----------------+
| StartID |     name     | DescendantCount |
+---------+--------------+-----------------+
|       1 | Euler        |              11 |
|       2 | Lagrange     |              10 |
|       3 | Laplace      |               9 |
|       4 | Fourier      |               8 |
|       5 | Poisson      |               8 |
|       6 | Dirichlet    |               7 |
|       7 | Lipschitz    |               6 |
|       8 | Klein        |               5 |
|       9 | Lindemann    |               2 |
|      10 | Furtwangler  |               2 |
|      11 | Hilbert      |               1 |
|      12 | Taussky-Todd |               1 |
+---------+--------------+-----------------+

Here DescendantCount counts the node itself as a descendant. You can subtract 1 from this result if you want to see 0 instead of 1 for the leaf nodes.

Here is SQL Fiddle.

Comments