S.Mohsen sh S.Mohsen sh - 3 months ago 13
SQL Question

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field

ID
and another field as
collection_time
. I want to select latest record for each ID. Unfortunately combination of
(ID, collection_time)
time is not unique together in my data. I want just one of records with the maximum
collection time
. I have tried two solutions but none of them has worked for me:

First: using query

SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1


This results in
Resources exceeded
error that I guess is because of
ORDER BY
in the query.

Second
Using join between table and latest time:

(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)


this solution does not work for me because
(ID, collection_time)
are not unique together so in
JOIN
result there would be multiple rows for each
ID
.

I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?

Answer

Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn 
  FROM (
    SELECT tab1.* 
    FROM mytable AS tab1
    INNER JOIN (
      SELECT ID, MAX(collection_time) AS second_time 
      FROM mytable GROUP BY ID
    ) AS tab2
    ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
  )
)
WHERE rn = 1  

And with Standard SQL (proposed by S.Mohsen sh)

WITH myTable AS (
  SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
  SELECT ID,
  MAX(collection_time) AS second_time 
  FROM myTable GROUP BY ID
),
tab2 AS (
  SELECT * FROM myTable
),
joint AS (
  SELECT tab2.* 
  FROM tab2 INNER JOIN tab1
  ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time 
)
SELECT * EXCEPT(rn) 
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn 
  FROM joint
)
WHERE rn=1
Comments