I'd like to use a window function to determine, for each row, the total number of preceding records meeting a certain criteria.
A specific example:
clone=# \d test
Column | Type | Modifiers
id | bigint |
date | timestamp without time zone |
SELECT id, date, count(*) OVER (HAVING previous_rows.date >= (date - '1 hour'::interval))
SELECT a.id, a.date, count(b.*)-1
FROM test a, test b
WHERE (b.date >= a.date - '1 hour'::interval AND b.date < a.date)
GROUP BY 1,2
ORDER BY 2;
I don't think you can do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame.
Generally, you'll have to define lower and upper border of your window carefully: The following queries exclude the current row and include the lower border.
There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...
ts instead of reserved word
date as column name.
CREATE TEMP TABLE test ( id bigint ,ts timestamp );
Use CTEs, aggregate timestamps into an array, unnest, count ...
While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.
I took Roman's query and tried to streamline it a bit:
- Remove 2nd CTE which is not necessary.
- Transform 1st CTE into subquery, which is faster.
count() instead of re-aggregating into an array and counting with
But array handling is expensive, and performance still deteriorates badly with more rows.
SELECT id, ts ,(SELECT count(*)::int - 1 FROM unnest(dates) x WHERE x >= sub.ts - interval '1h') AS ct FROM ( SELECT id, ts ,array_agg(ts) OVER(ORDER BY ts) AS dates FROM test ) sub;
You could solve this with a plain and simple, ugly correlated subquery. A lot faster, but still ...
SELECT id, ts ,(SELECT count(*) FROM test t1 WHERE t1.ts >= t.ts - interval '1h' AND t1.ts < t.ts) AS ct FROM test t ORDER BY ts;
Loop over rows in chronological order with a
row_number() in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers. Should perform nicely.
CREATE OR REPLACE FUNCTION running_window_ct() RETURNS TABLE (id bigint, ts timestamp, ct int) AS $func$ DECLARE i CONSTANT interval = '1h'; -- given interval for time frame cur CURSOR FOR SELECT t.ts + i AS ts1 -- incremented by given interval , row_number() OVER (ORDER BY t.ts) AS rn FROM test t ORDER BY t.ts; -- in chronological order rec record; -- for current row from cursor rn int; BEGIN OPEN cur; FETCH cur INTO rec; -- open cursor, fetch first row ct := -1; -- init; -1 covers special case at start FOR id, ts, rn IN SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts) FROM test t ORDER BY t.ts -- in same chronological order as cursor LOOP IF rec.ts1 >= ts THEN -- still in range ... ct := ct + 1; -- ... just increment ELSE -- out of range ... LOOP -- ... advance cursor FETCH cur INTO rec; EXIT WHEN rec.ts1 >= ts; -- earliest row within time frame END LOOP; ct := rn - rec.rn; -- new count END IF; RETURN NEXT; END LOOP; END $func$ LANGUAGE plpgsql;
SELECT * FROM running_window_ct();
With the table from above I ran a quick benchmark on my old test server: PostgreSQL 9.1.9 on Debian).
-- TRUNCATE test; INSERT INTO test SELECT g, '2013-08-08'::timestamp + g * interval '5 min' + random() * 300 * interval '1 min' -- halfway realistic values FROM generate_series(1, 10000) g; CREATE INDEX test_ts_idx ON test (ts); ANALYZE test; -- temp table needs manual analyze
I varied the bold part for each run and took the best of 5 with
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5.488 ms
FNC: 1.115 ms
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms
COR: 6760 ms
FNC: 828 ms
The function is the clear victor. It is fastest by an order of magnitude and scales best.
Array handling cannot compete.