Tudor Ciotlos - 1 year ago 79

SQL Question

Consider a database table with the following columns:

- mathematician_id
- name
- advisor1
- advisor2

The database represents data from the Math Genealogy Project, where each mathematician usually has one single advisor, but there are situations when there are two advisors.

Visual aid to make things clearer:

How do I count the number of descendants for each of the mathematicians?

I should probably use Common Table Expressions (

I adapted the solution for SQL Server provided by Vladimir Baranov to also work in PostgreSQL:

`WITH RECURSIVE cte AS (`

SELECT m.id as start_id,

m.id,

m.name,

m.advisor1,

m.advisor2,

1 AS level

FROM public.mathematicians AS m

UNION ALL

SELECT cte.start_id,

m.id,

m.name,

m.advisor1,

m.advisor2,

cte.level + 1 AS level

FROM public.mathematicians AS m

INNER JOIN cte ON cte.id = m.advisor1

OR cte.id = m.advisor2

),

cte_distinct AS (

SELECT DISTINCT start_id, id

FROM cte

)

SELECT cte_distinct.start_id,

m.name,

COUNT(*)-1 AS descendants_count

FROM cte_distinct

INNER JOIN public.mathematicians AS m ON m.id = cte_distinct.start_id

GROUP BY cte_distinct.start_id, m.name

ORDER BY cte_distinct.start_id

Answer Source

You didn't say what DBMS you use. I'll use SQL Server for this example, but it will work in other databases that support recursive queries as well.

**Sample data**

I entered only the right part of your tree, starting from Euler.

The most interesting part is the multiple paths between Lagrange and Dirichlet.

```
DECLARE @T TABLE (ID int, name nvarchar(50), Advisor1ID int, Advisor2ID int);
INSERT INTO @T (ID, name, Advisor1ID, Advisor2ID) VALUES
(1, 'Euler', NULL, NULL),
(2, 'Lagrange', 1, NULL),
(3, 'Laplace', NULL, NULL),
(4, 'Fourier', 2, NULL),
(5, 'Poisson', 2, 3),
(6, 'Dirichlet', 4, 5),
(7, 'Lipschitz', 6, NULL),
(8, 'Klein', NULL, 7),
(9, 'Lindemann', 8, NULL),
(10, 'Furtwangler', 8, NULL),
(11, 'Hilbert', 9, NULL),
(12, 'Taussky-Todd', 10, NULL);
```

This is how it looks like:

```
SELECT * FROM @T;
+----+--------------+------------+------------+
| ID | name | Advisor1ID | Advisor2ID |
+----+--------------+------------+------------+
| 1 | Euler | NULL | NULL |
| 2 | Lagrange | 1 | NULL |
| 3 | Laplace | NULL | NULL |
| 4 | Fourier | 2 | NULL |
| 5 | Poisson | 2 | 3 |
| 6 | Dirichlet | 4 | 5 |
| 7 | Lipschitz | 6 | NULL |
| 8 | Klein | NULL | 7 |
| 9 | Lindemann | 8 | NULL |
| 10 | Furtwangler | 8 | NULL |
| 11 | Hilbert | 9 | NULL |
| 12 | Taussky-Todd | 10 | NULL |
+----+--------------+------------+------------+
```

**Query**

It is a classic recursive query with two interesting points.

1) The recursive part of the `CTE`

joins to the anchor part using both `Advisor1ID`

and `Advisor2ID`

:

```
INNER JOIN CTE
ON CTE.ID = T.Advisor1ID
OR CTE.ID = T.Advisor2ID
```

2) Since it is possible to have multiple paths to the descendant, recursive query may output the node several times. To eliminate these duplicates I used `DISTINCT`

in `CTE_Distinct`

. It may be possible to solve it more efficiently.

To understand better how the query works run each CTE separately and examine intermediate results.

```
WITH
CTE
AS
(
SELECT
T.ID AS StartID
,T.ID
,T.name
,T.Advisor1ID
,T.Advisor2ID
,1 AS Lvl
FROM @T AS T
UNION ALL
SELECT
CTE.StartID
,T.ID
,T.name
,T.Advisor1ID
,T.Advisor2ID
,CTE.Lvl + 1 AS Lvl
FROM
@T AS T
INNER JOIN CTE
ON CTE.ID = T.Advisor1ID
OR CTE.ID = T.Advisor2ID
)
,CTE_Distinct
AS
(
SELECT DISTINCT
StartID
,ID
FROM CTE
)
SELECT
CTE_Distinct.StartID
,T.name
,COUNT(*) AS DescendantCount
FROM
CTE_Distinct
INNER JOIN @T AS T ON T.ID = CTE_Distinct.StartID
GROUP BY
CTE_Distinct.StartID
,T.name
ORDER BY CTE_Distinct.StartID;
```

**Result**

```
+---------+--------------+-----------------+
| StartID | name | DescendantCount |
+---------+--------------+-----------------+
| 1 | Euler | 11 |
| 2 | Lagrange | 10 |
| 3 | Laplace | 9 |
| 4 | Fourier | 8 |
| 5 | Poisson | 8 |
| 6 | Dirichlet | 7 |
| 7 | Lipschitz | 6 |
| 8 | Klein | 5 |
| 9 | Lindemann | 2 |
| 10 | Furtwangler | 2 |
| 11 | Hilbert | 1 |
| 12 | Taussky-Todd | 1 |
+---------+--------------+-----------------+
```

Here `DescendantCount`

counts the node itself as a descendant. You can subtract 1 from this result if you want to see 0 instead of 1 for the leaf nodes.

Here is SQL Fiddle.