peter.rando peter.rando - 1 year ago 57
SQL Question

Worth de-normalizing our data?

There are lots of posts and discussions around normalizing data. Most of the time I see people sticking pretty hard to normalization, but not always, and it seems to be case-by-case, so I'll describe our case. It does not seem complicated, but I feel like maybe I'm just missing something elegant. I'd love if someone could either:

  • give me or point me at a specific solution or type of solution, or

  • support the de-normalized idea I'm considering.

The main thing is that we'll be doing is near-real time searches, filtering the results character-by-character as users enters text in a search field, so things need to be very responsive. But very low-powered hardware - think IoT. The searches need to return individual item names, bundle names, and the lists of individual items within the found bundles. The items and bundles have a many-to-many relationship, though the number of items in any bundle are limited, so there are bounds, for what that's worth.

Ex DB:
[ items ]
int: item_id
string: name
[ bundles ]
int: bundle_id
string: bundle_name
[ items_x_bundles ]
int: item_id
int: bundle_id

Imagine different bundles of food in gift baskets, with typically no more than, say, 10 items in a given basket combination, but there is no absolute fixed limit. New bundles are created rarely, and never change.

Lets say there are various individual items, such as:

apple, orange, pear, banana, saltines, cheez-its, ritz,
potato chips, carrots, peas, beans, oreos, gummies,
hershey bars, coke, gatorade, milk, etc.

And bundles, such as:

special : [ apple, saltines, peas, gummies, coke ],
deluxe: [ pear, carrots, potato chips, oreos ],
fancy: [ orange, ritz, beans, gummies, milk ],
mondo: [ banana, pear, saltines, carrots, peas, oreos, coke, milk ]

A search for "delu" would return:

[ deluxe: [ pear, carrots, potato chips, oreos ]

A search for "appl" would return:

[ apple ]
[ special : [ apple, saltines, peas, gummies, coke ] ]

A search for "milk" would return:

[ milk ]
[ fancy: [ orange, ritz, beans, gummies, milk ]
[ mondo: [banana, pear, saltines, carrots, peas, oreos, coke, milk ]

If we keep the data fully normalized, it's easy to find individual item names, but much more complex to return the list of individual items in every basket that contains the search string. Efficiency is important, because again, this will be running on low-powered IoT hardware. Using sqlite3, if that matters.

A potential solution would be to add a field to the Bundle table when creating bundles. Something like:

string: bundle_items

Which for [special] could look like:

"apple / saltines / peas / gummies / coke".

This makes everything much faster/easier to search at the expense of redundancy. It feels like a "hack" to me, but I'm not seeing an obvious elegant, efficient solution.


I'm compressing the 5 updates/iterations into just this one.

Perhaps I wasn't as clear above as I could have been, but the performance issue is inherent. Low powered IoT-grade hardware, and a user-facing real-time filter that requires searching the data with each character entered. We anticipate that no matter how we structure it, it will not be as fast as we would like, because any delays will be directly noticeable to the user, even fractions of a second. I don't have hard numbers because while performing benchmarking simulations on a dev machine is fairly easy, not so much the case on the real hardware yet. Does this mean that we'll need to de-normalize/optimize No Matter What? Perhaps, but I don't really know this for a fact yet, hence the question here. Plus, I'm wondering if there are any glaring concerns with the particular de-normalization method we're considering (above).

I know how I'd query the de-normalized data, but I don't know how to structure a smart, reasonably optimized query on the normalized data. That could help guide us with the decision. So:

Question #1) what would a smart (fast) query on the normalized data look like, to achieve the results listed above?

Question #2) does anyone see any glaring problems with the de-normalization method I've described. Within the context described, does it make sense and/or is there a different, better solution?

After a couple passes, Bill Karwin's query below works, so that answers part one, thanks. Part 2 may end up in another question eventually.

If anyone is following along, the real-world percentage differences on the different types of queries varied so much (depending on the number of records) that frankly we need to dig in deeper. That it differs is no surprise, but the amount was staggering. Varied from around 15x to over 35,000x, with not unreasonable numbers of records. Even at 15x, which is probably closer to real-world, I’d say we’re leaning toward de-normalizing, but this gave a working normalized query to test with.

Answer Source

If you keep the data in normalized tables, you can do a query like:

After a couple of edits and testing this query (SQLFiddle):

SELECT CONCAT(b1.bundle_name, ' : ', GROUP_CONCAT(
FROM bundles b1 
JOIN items_x_bundles bi1 USING (bundle_id)
JOIN items i1 USING (item_id)
WHERE b1.bundle_name LIKE CONCAT('milk', '%')
GROUP BY b1.bundle_id
SELECT CONCAT(b2.bundle_name, ' : ', GROUP_CONCAT(
FROM bundles b2
JOIN items_x_bundles bi2 ON (b2.bundle_id=bi2.bundle_id)
JOIN items i2 ON (bi2.item_id=i2.item_id)
JOIN items_x_bundles bi2b ON (b2.bundle_id=bi2b.bundle_id)
JOIN items i2b ON (bi2b.item_id=i2b.item_id)
WHERE LIKE CONCAT('milk', '%')
GROUP BY b2.bundle_id
FROM items i3
WHERE LIKE CONCAT('milk', '%')

The ? placeholder is where you bind your search word. Yeah, you'll have to bind it three times.

Put indexes on items(name), bundles(bundle_name), items_x_bundles(item_id,bundle_id) and items_x_bundles(bundle_id,item_id).

Then use EXPLAIN to confirm that the query uses indexes effectively.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download