AspiringSoftwareDeveloper AspiringSoftwareDeveloper - 6 months ago 131
SQL Question

SQL- jaccard similarity

My table looks as follows:

author | group

daniel | group1,group2,group3,group4,group5,group8,group10
adam | group2,group5,group11,group12
harry | group1,group10,group15,group13,group15,group18
...
...


I want my output to look like:

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam | harry| 0 | 10


THANK YOU

Answer

Try below (for BigQuery)

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2

Added solution for BigQuery Standard SQL

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author

What I like about this one:

  • much simpler / friendlier code
  • no CROSS JOIN and extra GROUP BY needed

When/If try - make sure to uncheck Use Legacy SQL checkbox under Show Options