JohnnyQ JohnnyQ - 3 months ago 13
SQL Question

Huge performance differences between sum(column_name), sum(1) and count(*) on a large dataset

EDIT:

Since you guys suggested creating separate tables for player/tournament names and replacing strings with foreign keys I did the following:

SELECT DISTINCT tournament INTO tournaments FROM chess_data2
ALTER TABLE tournaments ADD COLUMN id SERIAL PRIMARY KEY


I repeated that for namew and nameb, then went about replacing strings with foreign keys. Here is where it got tricky - I am not able to do it in a "legit" time.

I tried two approaches was the following:

1) Delete existing indexes

1) Create individual indexes for namew, nameb and tournament separately

1) Run the query inserting the data I want into a new table:

SELECT date, whiterank, blackrank, t_round, result,
(SELECT p.id FROM players p WHERE c_d2.namew = p.name) AS whitep,
(SELECT p2.id FROM players p2 WHERE c_d2.nameb = p2.name) AS blackp,
(SELECT t.id FROM tournaments t WHERE t_d2.tournament = t.t_name) AS tournament
INTO final_chess from chess_data2 c_d2


Unfortunately it was very slow, so I came back to user Boris Shchegolev. In comment, he suggested creating a new column in the existing table chess_data2 and updating. So I did it:

ALTER TABLE chess_data2 ADD COLUMN name_id INTEGER
UPDATE chess_data2 cd2 SET namew_id = (SELECT id FROM players WHERE name = cd2.namew)"


I started those queries half an hour ago, the first one was instant but the second one takes forever.

What should I do now about it?

INITIAL QUESTION:

Database schema:

date DATE

namew TEXT

nameb TEXT
whiterank INTEGER

blackrank INTEGER

tournament TEXT

t_round INTEGER

result REAL

id BIGINT

chess_data2_pkey(id)

black_index (nameb, tournament, date)

chess_data2_pkey(id) UNIQUE

w_b_t_d_index (namew, nameb, tournament, date)

white_index (namew, tournament, date)

Problem:

The performance of the following statement is very good (~60-70 sec. in a database with 3 mln entries):

# Number of points that the white player has so far accrued throughout the tournament
(SELECT coalesce(SUM(result),0) from chess_data2 t2
where (t1.namew = t2.namew) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90)
+ SELECT coalesce(SUM(1-result),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90 ) AS result_in_t_w
from chessdata2 t1


Meanwhile, the following select (which has EXACTLY the same where clauses) is taking forever to compute.

# Number of games that the white player has so far played in the tournament
(SELECT coalesce(count(*),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(count(*),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1


I tried a different approach (with sum) and it didn't go better either:

# Number of games that the white player has so far played in the tournament
(SELECT coalesce(sum(1),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(sum(1),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1


Any idea what's going on here and how to fix that? I'm using python 3.5 and psycopg2 in PyCharm to run those queries. I will be very happy to provide any additional information as it is a very important project for me.

EXPLAIN ANALYZE (Used for the last query):

Seq Scan on chess_data2 t1 (cost=0.00..49571932.96 rows=2879185 width=86) (actual time=0.061..81756.896 rows=2879185 loops=1)
Planning time: 0.161 ms
Execution time: 81883.716 ms
SubPlan 2
SubPlan 1
-> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
-> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
-> Index Only Scan using white_index on chess_data2 t2 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=2879185)
-> Index Only Scan using black_index on chess_data2 t2_1 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=2879185)
Rows Removed by Filter: 1
Rows Removed by Filter: 1
Index Cond: ((namew = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
Index Cond: ((nameb = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
Heap Fetches: 6009767
Heap Fetches: 5303160
Filter: (t1.date < (date + 90))
Filter: (t1.date < (date + 90))

Answer

The queries are performing poorly due to poor table design. From the EXPLAIN it is obvious that the database uses indexes, but the the indexed fields are all TEXT and the indexes are huge.

To fix it:

  • create table names
  • replace namew and nameb with namew_id and nameb_id, both referencing names
  • create table tournaments
  • replace tournament with tournament_id referencing tournaments
  • reindex black_index as (nameb_id, tournament_id, date)
  • reindex white_index as (namew_id, tournament_id, date)
  • drop w_b_t_d_index unless you use it in some other query
  • remove the useless coalesce from the count(*) query

Your query should then look like this:

SELECT
    (
        SELECT count(*)
        FROM chess_data t2 
        WHERE
            t1.namew_id = t2.namew_id AND
            t1.tournament_id = t2.tournament_id AND
            t1.date > t2.date AND 
            t1.date < t2.date + 90
    )
    +
    (
        SELECT count(*)
        FROM chess_data2 t2
        WHERE 
            t1.namew_id = t2.nameb_id AND
            t1.tournament_id = t2.tournament_id AND 
            t1.date > t2.date AND 
            t1.date < t2.date + 90
    ) AS games_t_w
FROM chess_data2 t1
Comments