Manuel G - 1 year ago 90

SQL Question

I am trying to query Google BigQuery public Reddit Dataset. My goal is to compute the similarity of subreddits using Jaccards' Index , which is defined by:

My plan is to select the top N=1000 subreddits in terms of number of comments in August 2016. Then compute their cartesian product to get the combinations of all the subreddits in a

`subreddit1, subreddit2`

Then use those rows of combinations to query the union of users between subreddit1 and subreddit 2 as well as the intersection.

The query I have so far is this:

`SELECT`

subreddit1,

subreddit2,

(SELECT

COUNT(DISTINCT author)

FROM `fh-bigquery.reddit_comments.2016_08`

WHERE subreddit = subreddit1

OR subreddit = subreddit2

LIMIT 1

) as subreddits_union,

(

SELECT

COUNT(DISTINCT author)

FROM `fh-bigquery.reddit_comments.2016_08`

WHERE subreddit = subreddit1

AND author IN (

SELECT author

FROM `fh-bigquery.reddit_comments.2016_08`

WHERE subreddit= subreddit2

GROUP BY author

) as subreddits_intersection

FROM

(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2

FROM (

SELECT subreddit, count(*) as n_comments

FROM `fh-bigquery.reddit_comments.2016_08`

GROUP BY subreddit

ORDER BY n_comments DESC

LIMIT 1000

) a

CROSS JOIN (

SELECT subreddit, count(*) as n_comments

FROM `fh-bigquery.reddit_comments.2016_08`

GROUP BY subreddit

ORDER BY n_comments DESC

LIMIT 1000

) b

WHERE a.subreddit < b.subreddit

)

Which ideally would give the results:

`subreddit1, subreddit2, subreddits_union, subreddits_interception`

-----------------------------------------------------------------

Art | Politics | 50000 | 21000

Art | Science | 92320 | 15000

... | ... | ... | ...

However, this query gives me the following BigQuery error:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

Which I understand. However I don't think this query can be translated into an efficient join. Given that BQ does not have an apply method, is there any way this query could be set up without resorting to individual queries? Maybe with a

`PARTITION BY`

Answer Source

Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?

Maybe something along the lines of

```
WITH top_most AS (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 20
),
authors AS (
SELECT DISTINCT author, subreddit
FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2
```