transistor1 transistor1 - 14 days ago 9
SQL Question

Connected Components

I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".

Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.

I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.

Please note:


  • Due to company restrictions, I cannot use stored procedures.

  • I can't use user defined functions (UDFs)

  • I can't use user defined types (UDTs)



A correct answer will


  • Be written for Teradata or compatible

  • Be more efficient than my answer

  • Respect the restrictions I mention above




Sample Input:

Sample Input

Desired Output:

Desired Output

Answer

You need a recursive apporach, but your WITH RECURSIVE creates a humongous intermediate result, which leads to no more spool.

For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk,

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

Now repeat the following update until 0 rows processed:

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

To get the related categories you finally need:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

For an exact match of your expected result you need to add a DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt