The Traveling Coder The Traveling Coder - 1 month ago 19
SQL Question

BigQuery: how to perform rolling timestamp window group count that produces row for each day

this is an extension to a question that I asked and resolved on StackOverflow here.

I'm a BigQuery and SQL novice and I wanted to construct a Standard SQL query that would group and count events over a rolling time window of X days. My data table looks like this:

event_id | url | timestamp
-----------------------------------------------------------
xx a.html 2016-10-18 15:55:16 UTC
xx a.html 2016-10-19 16:68:55 UTC
xx a.html 2016-10-25 20:55:57 UTC
yy b.html 2016-10-18 15:58:09 UTC
yy a.html 2016-10-18 08:32:43 UTC
zz a.html 2016-10-20 04:44:22 UTC
zz c.html 2016-10-21 02:12:34 UTC


I'm tracking events that occur on urls. I want to know how many times each event occurred on each url during a rolling time period of X days. When I asked this question, I got a great answer:

WITH dailyAggregations AS (
SELECT
DATE(ts) AS day,
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec,
COUNT(1) AS events
FROM yourTable
GROUP BY day, url, event_id, sec
)
SELECT
url, event_id, day, events,
SUM(events)
OVER(PARTITION BY url, event_id ORDER BY sec
RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
) AS rolling4daysEvents
FROM dailyAggregations


where 259200 is 3 days in seconds (3x24x3600). As I understand it, this query creates an intermediate table that groups and counts events by day. It also converts the timestamp field into its unix second equivalent. Then it sums up the events using a window that is measured in seconds.

Now this will produce a table with correct running totals, but it does not guarantee a row for every date, url, and event. In other words, there will be dates missing from the resultant table if there were dates when a given event never occurred on a given url. Bottom line, can I modify the above query (or construct a different query) that will correctly produce values for rolling4daysEvents for each date in an interval? eg: like an interval defined as:

SELECT *
FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
ORDER BY day ASC


Thanks!

Answer
WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
),
calendar AS (
  SELECT day
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
)
SELECT 
  c.day, url, event_id, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM calendar AS c
LEFT JOIN dailyAggregations AS a
ON a.day = c.day