Dovy Dovy - 4 months ago 88
SQL Question

BigQuery DeDuplication on two columns as unique key

We use BigQuery religiously and have two tables that essentially were updated in parallel by different process. The problem I have we don't have a unique identifier for tables and the goal is to combine the two tables with zero duplication if possible.. The unique identifier is two columns combined.

I've tried various MySQL-based queries, but none seem to work in BigQuery. So I am posting here for some assistance. :)

Step 1. Copy the "clean" table into a new merged table.

Step 2. Query the "dirty" (old) table and insert any missing entries.

Query Attempt 1:

SELECT
COUNT(c.*)
FROM
[flash-student-96619:device_data.device_datav3_20160530] AS old
WHERE NOT EXISTS (
SELECT
1
FROM
[flash-student-96619:device_data_v7_merged.20160530] AS new
WHERE
new.dsn = old.dsn
AND new.timestamp = old.timestamp
)


Error: error at: 6.1 - 10.65. Only one query can be executed at a time.

Query Attempt 2:

SELECT
*
FROM
[flash-student-96619:device_data.device_datav3_20160530]
WHERE
(dsn, timestamp) NOT IN (
SELECT
dsn,
timestamp
FROM
[flash-student-96619:device_data_v7_merged.20160530]
)


Error: Encountered " "," ", "" at line 6, column 7. Was expecting: ")" ...

Honestly, if I could do this in one query I would be happy. I need to fetch from two tables, and make a new one with unique data.

Any assistance?

Answer

Something like below should work

SELECT * 
FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY dsn, timestamp) AS dup
  FROM
    [flash-student-96619:device_data.device_datav3_20160530],
    [flash-student-96619:device_data_v7_merged.20160530] 
) 
WHERE dup = 1  

I recommend using explicit list of fields instead of * in outer SELECT so you can omit dup from actual output