Robert Robert - 3 months ago 7
SQL Question

How to increase performance of the query which counts the records by a specific time interval?

I use a PostgreSQL 9.5. Below you can find a structure of my table, my query and a result of the query. I would like to increase performance of my query. The query counts the records by a specific time interval, for example:

250 milliseconds
,
1 second
,
22 minutes
,
2 days and 30 minutes
, etc.

The query is fast for the large intervals like
60 minutes
but for the small intervals like
4 seconds
it's very slow.

The most important things:


  • I work with a large database (20 million rows and more but in query I use a part of this database using
    WHERE
    clause, for example: 1 million or more).

  • There are always the
    id_user_table
    and
    sip
    columns in the
    WHERE
    clause. In some cases, the
    WHERE
    clause colud include all of the columns of the table, it depends on the user's choice.

  • At the moment I've created a B-Tree index on the
    starttime
    column:

    CREATE INDEX starttime_interval ON data_store (starttime);



Do you know some ways to increase performance of my query?

For example, by means of:


  • creating some indexes on the columns (which indexes? and how to create them?),

  • improving my query,

  • changing some settings in the PostgreSQL,

  • or somethings else.



Here's the structure of my table:

column_name | udt_name | length | is_nullable | key
---------------+-------------+--------+-------------+--------
id | int8 | | NO | PK
id_user_table | int4 | | NO | FK
starttime | timestamptz | | NO |
time | float8 | | NO |
sip | varchar | 100 | NO |
dip | varchar | 100 | NO |
sport | int4 | | YES |
dport | int4 | | YES |
proto | varchar | 50 | NO |
totbytes | int8 | | YES |
info | text | | YES |
label | varchar | 10 | NO |


Simple
SELECT * FROM data_Store WHERE id_user_table=1 and sip='147.32.84.138' ORDER BY starttime
returns this:

id | id_user_table | starttime | sip | other columns...
-----+---------------+----------------------------+---------------+--------------------
185 | 1 | 2011-09-12 15:24:03.248+02 | 147.32.84.138 | ...
189 | 1 | 2011-09-12 15:24:03.256+02 | 147.32.84.138 | ...
312 | 1 | 2011-09-12 15:24:06.112+02 | 147.32.84.138 | ...
313 | 1 | 2011-09-12 15:24:06.119+02 | 147.32.84.138 | ...
450 | 1 | 2011-09-12 15:24:09.196+02 | 147.32.84.138 | ...
451 | 1 | 2011-09-12 15:24:09.203+02 | 147.32.84.138 | ...
452 | 1 | 2011-09-12 15:24:09.21+02 | 147.32.84.138 | ...


Here's my query for the 4 seconds time intervals:

WITH generate_period AS(

SELECT generate_series(date_trunc('second',min(starttime)),
date_trunc('second',max(starttime)),
interval '4 second') as tp
FROM data_store
WHERE id_user_table=1 and sip='147.32.84.138' --other restrictions

), data_series AS(

SELECT date_trunc('second', starttime) AS starttime, count(*) AS ct
FROM data_store
WHERE id_user_table=1 and sip='147.32.84.138' --other restrictions
GROUP BY 1

)

SELECT gp.tp AS starttime-from,
gp.tp + interval '4 second' AS starttime-to,
COALESCE(sum(ds.ct),0) AS ct
FROM generate_period gp
LEFT JOIN data_series ds ON date_trunc('second',ds.starttime) >= gp.tp
and date_trunc('second',ds.starttime) < gp.tp + interval '4 second'
GROUP BY 1
ORDER BY 1;


Here's the result of the query:

starttime-from | starttime-to | ct
------------------------+------------------------+---------
2011-09-12 15:24:03+02 | 2011-09-12 15:24:07+02 | 4
2011-09-12 15:24:07+02 | 2011-09-12 15:24:11+02 | 3
2011-09-12 15:24:11+02 | 2011-09-12 15:24:15+02 | 0
... | ... | ...


Here's the result of the
EXPLAIN ANALYZE
which I received in the pgAdmin for the 4 seconds time intervals:

Sort (cost=7477837.88..7477838.38 rows=200 width=16) (actual time=1537280.238..1537289.519 rows=60141 loops=1)
Sort Key: gp.tp
Sort Method: external merge Disk: 1792kB
CTE generate_period
-> Aggregate (cost=166919.73..166924.74 rows=1000 width=8) (actual time=752.301..823.022 rows=60141 loops=1)
-> Seq Scan on data_store (cost=0.00..163427.57 rows=698431 width=8) (actual time=0.034..703.845 rows=679951 loops=1)
Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
Rows Removed by Filter: 4030687
CTE data_series
-> GroupAggregate (cost=242521.00..250085.18 rows=186076 width=8) (actual time=1233.414..1341.701 rows=57555 loops=1)
Group Key: (date_trunc('second'::text, data_store_1.starttime))
-> Sort (cost=242521.00..244267.08 rows=698431 width=8) (actual time=1233.407..1284.110 rows=679951 loops=1)
Sort Key: (date_trunc('second'::text, data_store_1.starttime))
Sort Method: external sort Disk: 11960kB
-> Seq Scan on data_store data_store_1 (cost=0.00..165173.65 rows=698431 width=8) (actual time=0.043..886.224 rows=679951 loops=1)
Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
Rows Removed by Filter: 4030687
-> HashAggregate (cost=7060817.31..7060820.31 rows=200 width=16) (actual time=1537215.586..1537240.698 rows=60141 loops=1)
Group Key: gp.tp
-> Nested Loop Left Join (cost=0.00..6957441.76 rows=20675111 width=16) (actual time=1985.731..1536921.862 rows=74443 loops=1)
Join Filter: ((date_trunc('second'::text, ds.starttime) >= gp.tp) AND (date_trunc('second'::text, ds.starttime) < (gp.tp + '00:00:04'::interval)))
Rows Removed by Join Filter: 3461357700
-> CTE Scan on generate_period gp (cost=0.00..20.00 rows=1000 width=8) (actual time=752.303..910.810 rows=60141 loops=1)
-> CTE Scan on data_series ds (cost=0.00..3721.52 rows=186076 width=16) (actual time=0.021..3.716 rows=57555 loops=60141)
Planning time: 0.258 ms
Execution time: 1537389.102 ms


Update



Here's the another query but without
WITH cte
and
date_trunc()
expression so maybe this query will be easier to optimize:

SELECT gp.tp AS starttime_from,
gp.tp + interval '4 second' AS starttime_to,
count(ds.id)
FROM (SELECT generate_series(min(starttime), max(starttime), interval '4 second') as tp
FROM data_store
WHERE id_user_table=1 and sip='147.32.84.138' --other restrictions
) gp
LEFT JOIN data_store ds
ON ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
and id_user_table=1 and sip='147.32.84.138' --other restrictions
group by gp.tp
order by gp.tp;


The above query is much faster than the first query. At the moment B-Tree index on
starttime
column works but it's still not enough. If I set the
100 milliseconds
time intervals, I've to still wait too long. The
100 milliseconds
range is the minimal time interval which the user can set. I've just added B-Tree index on
sip
column but it doesn't help.

Here's the result of the
EXPLAIN ANALYZE
which I received in the pgAdmin for the 100 ms time intervals:

Sort (cost=14672356.96..14672357.46 rows=200 width=16) (actual time=9380.768..9951.074 rows=2405621 loops=1)
Sort Key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))
Sort Method: external merge Disk: 79880kB
-> HashAggregate (cost=14672346.81..14672349.31 rows=200 width=16) (actual time=6199.538..7232.962 rows=2405621 loops=1)
Group Key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))
-> Nested Loop Left Join (cost=2.02..14284329.59 rows=77603444 width=16) (actual time=0.321..4764.648 rows=3006226 loops=1)
-> Result (cost=1.58..6.59 rows=1000 width=0) (actual time=0.295..159.147 rows=2405621 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.79 rows=1 width=8) (actual time=0.208..0.208 rows=1 loops=1)
-> Index Scan using starttime_interval on data_store (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.204..0.204 rows=1 loops=1)
Index Cond: (starttime IS NOT NULL)
Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
Rows Removed by Filter: 144
InitPlan 2 (returns $1)
-> Limit (cost=0.43..0.79 rows=1 width=8) (actual time=0.050..0.050 rows=1 loops=1)
-> Index Scan Backward using starttime_interval on data_store data_store_1 (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.049..0.049 rows=1 loops=1)
Index Cond: (starttime IS NOT NULL)
Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
Rows Removed by Filter: 23
-> Index Scan using starttime_interval on data_store ds (cost=0.44..13508.28 rows=77603 width=16) (actual time=0.002..0.002 rows=0 loops=2405621)
Index Cond: ((starttime >= (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))) AND (starttime < ((generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00 (...)
Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
Rows Removed by Filter: 2
Planning time: 1.299 ms
Execution time: 11641.154 ms

Answer

Based on the @pozs and @RadekPostołowicz comments, the final query is as follows (for the 4 seconds time intervals):

SELECT gp.tp AS starttime_from, gp.tp + interval '4 second' AS starttime_to, count(ds.id)
FROM (SELECT generate_series(min(starttime),max(starttime), interval '4 second') as tp
      FROM data_store
      WHERE id_user_table=1 and sip='147.32.84.138'
      ORDER BY 1
     ) gp 
     LEFT JOIN data_store ds 
     ON ds.id_user_table=1 and ds.sip='147.32.84.138' 
        and ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
GROUP BY starttime_from

As @pozs noticed, for the very small time intervals, the result of the query includes a lot of zero count rows. These rows eat up space. In this case the query should include the HAVING count(ds.id) > 0 restriction but then you've to handle of these 0 on the client side. Here's the second version of the query which includes the HAVING restriction:

SELECT gp.tp AS starttime_from, gp.tp + interval '4 second' AS starttime_to, count(ds.id)
FROM (SELECT generate_series(min(starttime),max(starttime), interval '4 second') as tp
      FROM data_store
      WHERE id_user_table=1 and sip='147.32.84.138'
      ORDER BY 1
     ) gp 
     LEFT JOIN data_store ds 
     ON ds.id_user_table=1 and ds.sip='147.32.84.138' 
        and ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
GROUP BY starttime_from
HAVING count(ds.id) > 0

The most important thing is creating the multicolumn index which @RadekPostołowicz has created in his comment/answer:

CREATE INDEX my_index ON data_store (id_user_table, sip, starttime);

Why these columns? Because in each query I always use the id_user_table, the sip and the starttime columns in the WHERE clause.

Comments