cgn cgn - 7 months ago 31
SQL Question

Google Bigquery inconsistent when variable names changes in ORDER BY clause

My goal is to test if the grp's generated by one query, are the same grp's as the output of the same query. However, when I change a single variable name, I get different results.

Below I show an example of the same query where we know the results are the same. However, if you run this group, you will find one query produces different results than another.

SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM (
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test1 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test1
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM
(
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test2
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)


Far worse... now if you try running the exact same query, but simply change the variable name test1 to test3, you will get completely different results.

SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM (
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test3 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test3
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM
(
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test2
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)


I can think of no explanation that satisfies both of these bizarre behaviors and this is preventing me from being able to validate my data. Any ideas?

EDIT:

I've updated the BigQuery SQL in the way the responses would suggest, and the same inconsistencies occur.

Answer

The problem is nondeterminism in your row numbering.

There are many examples in this table where (word_count * word_count * corpus_date) is the same for several corpuses. So when you partition by word and order by test2, the ordering you use for assigning row numbers is nondeterministic.

When you run the same subquery twice within the same top-level query, BigQuery actually executes that subquery twice and may yield different results between the two runs due to that nondeterminism.

Changing the alias might have just caused your query to not hit in the cache, resulting in a different set of nondeterministic choices and different amount of overlap between the results.

You can confirm this by changing the ORDER BY clause in your analytic functions to include corpus. For example, change ORDER BY test2 to ORDER BY test2, corpus. Then the row numbering will be deterministic, and the queries will return zero results regardless of what aliases you use.