Caique Monteiro Araujo Caique Monteiro Araujo - 1 year ago 56
MySQL Question

What is a good pratice to store row counts in another table?

I’m designing a huge database, which in a few months will have a huge amount of data. The database will be projected in MySQL and all tables will use InnoDB engine. So, since now, I’ve been trying to develop the best model possible. So, let’s begin!

A simplified schema of tables can be seen below:

| users |
| PK user_id int unsigned auto_increment |
| (all other columns with user data) |

| products |
| PK product_id int unsigned auto_increment |
| (all other columns with product data) |

(table) user’s favorite products
| user_favproducts |
| PK FK user int unsigned (on delete cascade) |
| PK FK product int unsigned (on delete cascade) |

The user can only favorite a product once, and that’s why user and product are primary keys in user_favproducts. If user deletes account, all their favorite products will be also deleted; and if a product was deleted, all their favorites will be deleted.

Let’s suppose a scenario where user_favproducts has millions of rows with many users and many products. So, I’ve tried thinking about the best solution to get optimal performance: Imagine that every time a user opens a profile, the user will be shown a count of favorite products to the respective profile.

Alternative A

Simple: Get the count using SQL, as below:

SELECT count(product) WHERE user = 1;

For InnoDB engine, I’m not sure how it will work when a table has millions of rows or more.

Every time a profile is shown, the count of favorite products will also be shown. This means I would have to execute the SQL above every time. Thinking about performance, I don’t know if is the better way.

Alternative B

Store the count of favorite products in a metadata table, as below:

| user_metadata |
| PK FK user int unsigned |
| favproducts int unsigned |
| last_update timestamp on update current_timestamp |

To maintain this table updated, I considered three alternatives:

Alternative B.1

Schedule a cron job task that will run a script at periodic time. Then this script will update the count, as below:

* Get count
result = SELECT count(product) WHERE user = 1;
* Update count
UPDATE user_metadata SET favproducts = result WHERE user = 1;

Alternative B.2

Make a trigger where always, when a new insertion happens in user_favproducts, will update user_metadata by using an SQL, as below:

UPDATE user_metadata SET favproducts = favproducts + 1 WHERE user = 1;

Alternative B.3

Instead of using a trigger, use the same sql above, but this time it will be executed by a script, the same script that do a insertion.

Main question

My question is: Which approach from above will perform the best in counting how many user favorited products exist?

Answer Source

The best approach is a rather subjective matter.

Firstly, "huge" numbers for humans are not necessarily "huge" numbers for relational databases. Millions of rows are not a problem, as long as you can use indexes for searching. So, your simplified example SELECT count(product) WHERE user = 1; will almost certainly return in milliseconds as long as there is an index on the user column.

The alternative you're considering - adding the number of favourite products to the users table - is generally known as denormalization. There are many problems with denormalization - once you have two sources of truth for a fact, you have to make sure you keep them in sync, or you end up with hilarious bugs. Cron jobs mean that there will be time when the two sources of truth disagree; triggers are hard to maintain, and often introduce their own performance problems.

My general approach is a bit of effort, but if performance is you concern, it's the only way to stay sane...

Firstly, work out what your data size is going to be, create a test environment, create the most normalized schema on that environment, and populate it with test data (there are tools like DbMonster that help with this).

Once you have a test environment, figure out the queries your application will need to run, what the response time target is, and test the queries. You might use a unit testing framework for this.

Optimize those queries. Put LOTS of energy into that - it's the best time you can spend.

If you really cannot get your queries to fit in the target response times, see if you can fix the problem with hardware. Memory, RAM and SSDs are much cheaper than developer time.

If that doesn't work, consider denormalization. If you do this, write unit tests to protect yourself against bugs, or unexpected side-effects. For instance, your simplified trigger would slow down inserts into the user_favproducts table if accessing the user is slow.

Whatever you do, do not optimize for performance until you know, and can measure, that you have a problem. The maintainability impact can be severe, and it's much easier to make a slow, correct application faster than make a fast, incorrect application work properly.