dnaeye dnaeye - 3 months ago 9
SQL Question

How can I find nearest matching time from the same table?

I'm trying to match the start and end times a user watches a TV show from a single data table in Google BigQuery, but I'm not sure how to do this because I keep getting an error saying, "Table name cannot be resolved: dataset name is missing."

Events table

user_id show_id event_type logtime
------- -------- ---------- -----------------------
john 123 start 2016-08-01 06:00:00 UTC
john 123 start 2016-08-01 06:15:00 UTC
john 123 end 2016-08-01 06:10:00 UTC
john 123 end 2016-08-01 06:16:00 UTC


Desired results

user_id show_id start_time end_time
------- -------- ----------------------- -----------------------
john 123 2016-08-01 06:00:00 UTC 2016-08-01 06:10:00 UTC
john 123 2016-08-01 06:15:00 UTC 2016-08-01 06:16:00 UTC


This is my query so far:

SELECT user_id, show_id, st.logtime AS start_time, et.logtime AS end_time
FROM
(SELECT user_id, show_id, logtime FROM events WHERE event_type = 'start') AS st
JOIN
(SELECT user_id, show_id, logtime FROM events WHERE event_type = 'end') AS et
ON
st.logtime = (SELECT min(logtime) FROM events WHERE event_type = 'end')
AND st.user_id = et.user_id AND st.show_id = et.show_id


Mikhail's answer seems to work the best after validating a few examples, but...

SELECT
user_id, show_id,
logtime AS start_time,
next_logtime AS end_time
FROM (
SELECT
user_id, show_id, event_type, logtime,
LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type
FROM events
)
WHERE event_type = 'start'
AND next_event_type = 'end'


I don't know how to incorporate logic to handle consecutive instances of the same event_type. For example:

event_type logtime
---------- ------------------------
start 2016-08-01 09:20:00 UTC
start 2016-08-01 09:23:00 UTC
start 2016-08-01 09:24:00 UTC
end 2016-08-01 09:24:00 UTC
end 2016-08-01 09:24:00 UTC


In this scenario, I'd like to keep the earliest start time, 09:20, and the earliest end time, 09:24 (I think this makes sense...).

Answer

try below

SELECT 
  user_id, show_id,  
  logtime AS start_time,
  next_logtime AS end_time
FROM (
  SELECT 
    user_id, show_id, event_type, logtime,
    LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime
  FROM events 
)
WHERE event_type = 'start'  

Unfortunately, the data is pretty dirty, so there are events that may have a start time but no end time and vice versa

below example ignores start without end and vice versa
can be ajusted to whatever logic you have in mind though

SELECT 
  user_id, show_id,  
  logtime AS start_time,
  next_logtime AS end_time
FROM (
  SELECT 
    user_id, show_id, event_type, logtime,
    LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
    LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type
  FROM events 
)
WHERE event_type = 'start'
AND next_event_type = 'end' 

I'd like to keep the earliest start time, 09:20, and the earliest end time

SELECT 
  user_id, show_id, 
  MIN(start_time) AS start_time,
  MAX(end_time) AS end_time
FROM (
  SELECT 
    user_id, show_id,  
    logtime AS start_time,
    next_logtime AS end_time,
    SUM(event_type <> next_event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING  ) AS grp
  FROM (
    SELECT 
      user_id, show_id, event_type, logtime,
      LEAD(logtime) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_logtime,
      LEAD(event_type) OVER(PARTITION BY user_id, show_id ORDER BY logtime) AS next_event_type,
    FROM events 
  )
  WHERE event_type = 'start'
)
GROUP BY user_id, show_id, grp