Martin AJ Martin AJ - 5 months ago 6
SQL Question

How can I implement a viewed system for my website's posts?

Here is my current structure:

// posts
+----+--------+----------+-----------+------------+
| id | title | content | author_id | date_time |
+----+--------+----------+-----------+------------+
| 1 | title1 | content1 | 435 | 1468111492 |
| 2 | title2 | content2 | 657 | 1468113910 |
| 3 | title3 | content3 | 712 | 1468113791 |
+----+--------+----------+-----------+------------+

// viewed
+----+---------------+---------+------------+
| id | user_id_or_ip | post_id | date_tiem |
+----+---------------+---------+------------+
| 1 | 324 | 1 | 1468111493 |
| 2 | 546 | 3 | 1468111661 |
| 3 | 135.54.12.1 | 1 | 1468111691 |
| 5 | 75 | 1 | 1468112342 |
| 6 | 56.26.32.1 | 2 | 1468113190 |
| 7 | 56.26.32.1 | 3 | 1468113194 |
| 5 | 75 | 2 | 1468112612 |
+----+---------------+---------+------------+


Here is my query:

SELECT p.*,
(SELECT count(*) FROM viewed WHERE post_id = :id) AS total_viewed
FROM posts p
WHERE id = :id


Currently I've faced with a huge date for
viewed
table. Well what's wrong with my table structure (or database design)? In other word how can I improve it?

A website like stackoverflow has almost 12 million posts. Each post has (on average) 500 viewed. So the number of
viewed
's rows should be:

12000000 * 500 = 6,000,000,000 rows


Hah
:-)
.. Honestly I cannot even read that number (btw that number will grow up per sec). Well how stackoverflow handles the number of viewed for each post? Will it always calculate
count(*)
from
viewed
per post showing?

Answer

You are not likely to need partitioning, redis, nosql, etc, until you have many millions of rows. Meanwhile, let's see what we can do with what you do have.

Let's start by dissecting your query. I see WHERE id=... but no LIMIT or ORDER BY. Let's add to your table

INDEX(id, timestamp)

and use

WHERE id = :id
ORDER BY timestamp DESC
LIMIT 10

Any index is sorted by what is indexed. That is the 10 rows you are looking for are adjacent to each other. Even if the data is pushed out of cached, there will probably be only one block to provide those 10 rows.

But a "row" in a secondary index in InnoDB does not contain the data to satisfy SELECT *. The index "row" contains a pointer to the actual 'data' row. So, there will be 10 lookups to get them.

As for view count, let's implement that a different way:

CREATE TABLE ViewCounts (
    post_id ...,
    ct MEDIUMINT UNSIGNED NOT NULL,
    PRIMARY KEY post_id
) ENGINE=InnoDB;

Now, given a post_id, it is very efficient to drill down the BTree to find the count. JOINing this table to the other, we get the individual counts with another 10 lookups.

So, you say, "why not put them in the same table"? The reason is that ViewCounts is changing so frequently that those actions will clash with other activity on Postings. It is better to keep them separate.

Even though we hit a couple dozen blocks, that is not bad compared to scanning millions of rows. And, this kind of data is somewhat "cacheable". Recent postings are more frequently accessed. Popular users are more frequently accessed. So, 100GB of data can be adequately cached in 10GB of RAM. Scaling is all about "counting the disk hits".