Kohei Nozaki Kohei Nozaki - 3 months ago 11x
MySQL Question

How to select consistent data from multiple tables efficiently

I'm using MySQL 5.6. Let's say we have the following two tables:

a class diagram


has a huge amount of child
records that the number would be 10000 or 100000 or more.
get updated when its child
records are inserted or deleted, in one transaction. A
is calculated against all of its children

Under this situation, What's the most efficient way to fetch consistent data from those two tables?

If I issue the following two distinct SELECTs, I think I might get inconsistent data due to concurrent INSERT / UPDATEs:

  1. SELECT md5sum, version FROM DataSet WHERE dataset_id = 1000

  2. SELECT dataentry_id, content FROM DataEntry WHERE dataset_id = 1000
    -- I think the result of this query will possibly incosistent with the md5sum which fetched by former query

I think I can get consistent data with one query as follows:

SELECT e.dataentry_id, e.content, s.md5sum, s.version
FROM DataSet s
INNER JOIN DataEntry e ON (s.dataset_id = e.dataset_id)
WHERE s.dataset_id = 1000

But it produces redundant dataset which filled with 10000 or 100000 duplicated
s, So I guess it's not efficient (EDIT: My concerns are high network bandwidth and memory consumption).

I think using pessimistic read / write lock (
) would be another option but it seems overkill. Are there any other better approaches?


The join will ensure that the data returned is not affected by any updates that would have occurred between the two separate selects, since they are being executed as a single query.

When you say that md5sum and version are updated, do you mean the child table has a trigger on it for inserts and updates?

When you join the tables, you will get a "duplicate md5sum and version" because you are pulling the matching record for each item in the DataEntry table. It is perfectly fine and isn't going to be an efficiency issue. The alternative would be to use the two individual selects, but depending upon the frequency of inserts/updates, without a transaction, you run the very slight risk of getting data that may be slightly off.

I would just go with the join. You can run explain plans on your query from within mysql and look at how the query is executed and see any differences between the two approaches based upon your data and if you have any indexes, etc...

Perhaps it would be more beneficial to run these groups of records into a staging table of sorts. Before processing, you could call a pre-processor function that takes a "snapshot" of the data about to be processed, putting a copy into a staging table. Then you could select just the version and md5sum alone, and then all of the records, as two different selects. Since these are copied into a separate staging table, you wont have to worry about immediate updates corrupting your session of processing. You could set up timed jobs to do this or have it as an on-demand call. Again though, this would be something you would need to research the best approach given the hardware/network setup you are working with. And any job scheduling software you have available to you.