There are lots of posts and discussions around normalizing data. Most of the time I see people sticking pretty hard to normalization, but not always, and it seems to be case-by-case, so I'll describe our case. It does not seem complicated, but I feel like maybe I'm just missing something elegant. I'd love if someone could either:
- give me or point me at a specific solution or type of solution, or
- support the de-normalized idea I'm considering.
The main thing is that we'll be doing is near-real time searches, filtering the results character-by-character as users enters text in a search field, so things need to be very responsive. But very low-powered hardware - think IoT. The searches need to return individual item names, bundle names, and the lists of individual items within the found bundles. The items and bundles have a many-to-many relationship, though the number of items in any bundle are limited, so there are bounds, for what that's worth.
[ items ]
[ bundles ]
[ items_x_bundles ]
Imagine different bundles of food in gift baskets, with typically no more than, say, 10 items in a given basket combination, but there is no absolute fixed limit. New bundles are created rarely, and never change.
Lets say there are various individual items, such as:
apple, orange, pear, banana, saltines, cheez-its, ritz,
potato chips, carrots, peas, beans, oreos, gummies,
hershey bars, coke, gatorade, milk, etc.
And bundles, such as:
special : [ apple, saltines, peas, gummies, coke ],
deluxe: [ pear, carrots, potato chips, oreos ],
fancy: [ orange, ritz, beans, gummies, milk ],
mondo: [ banana, pear, saltines, carrots, peas, oreos, coke, milk ]
A search for "delu" would return:
[ deluxe: [ pear, carrots, potato chips, oreos ]
A search for "appl" would return:
[ apple ]
[ special : [ apple, saltines, peas, gummies, coke ] ]
A search for "milk" would return:
[ milk ]
[ fancy: [ orange, ritz, beans, gummies, milk ]
[ mondo: [banana, pear, saltines, carrots, peas, oreos, coke, milk ]
If we keep the data fully normalized, it's easy to find individual item names, but much more complex to return the list of individual items in every basket that contains the search string. Efficiency is important, because again, this will be running on low-powered IoT hardware. Using sqlite3, if that matters.
A potential solution would be to add a field to the Bundle table when creating bundles. Something like:
Which for [special] could look like:
"apple / saltines / peas / gummies / coke".
This makes everything much faster/easier to search at the expense of redundancy. It feels like a "hack" to me, but I'm not seeing an obvious elegant, efficient solution.
I'm compressing the 5 updates/iterations into just this one.
Perhaps I wasn't as clear above as I could have been, but the performance issue is inherent. Low powered IoT-grade hardware, and a user-facing real-time filter that requires searching the data with each character entered. We anticipate that no matter how we structure it, it will not be as fast as we would like, because any delays will be directly noticeable to the user, even fractions of a second. I don't have hard numbers because while performing benchmarking simulations on a dev machine is fairly easy, not so much the case on the real hardware yet. Does this mean that we'll need to de-normalize/optimize No Matter What? Perhaps, but I don't really know this for a fact yet, hence the question here. Plus, I'm wondering if there are any glaring concerns with the particular de-normalization method we're considering (above).
I know how I'd query the de-normalized data, but I don't know how to structure a smart, reasonably optimized query on the normalized data. That could help guide us with the decision. So:
Question #1) what would a smart (fast) query on the normalized data look like, to achieve the results listed above?
Question #2) does anyone see any glaring problems with the de-normalization method I've described. Within the context described, does it make sense and/or is there a different, better solution?
After a couple passes, Bill Karwin's query below works, so that answers part one, thanks. Part 2 may end up in another question eventually.
If anyone is following along, the real-world percentage differences on the different types of queries varied so much (depending on the number of records) that frankly we need to dig in deeper. That it differs is no surprise, but the amount was staggering. Varied from around 15x to over 35,000x, with not unreasonable numbers of records. Even at 15x, which is probably closer to real-world, I’d say we’re leaning toward de-normalizing, but this gave a working normalized query to test with.