Jesse Jesse - 2 months ago 7
Javascript Question

Fastest way to find relevant results in array from an input array

As a mostly front-end developer, this is in the realm of computer science that I don't often delve into, but here's my scenario:

I've got an input of a string, split on spaces, say

"pinto beans"


I've got a array of results to search, that contains results like:
["beans, mung","beans, pinto","beans, yellow","beans, fava"]


what might be the quickest way (preferably in javascript or php) to find the most "relevant" results, aka most matches, for instance, in the above case, I would like to sort the return array so that
"beans, pinto"
is put at the top, and the rest come below, and any other results would go below those.

My first attempt at this would be to do something like matching each result item against each input item, and incrementing matches on each one, then sorting by most matches to least.

This approach would require me to iterate through the entire result array a ton of times though, and I feel that my lack of CS knowledge is leaving me without the best solution here.

/* EDIT: Here's how I ended up dealing with the problem: */

Based on crazedfred's suggestion and the blog post he mentioned (which was VERY helpful), I wrote some php that basically uses a combination of the trie method and the boyer-moore method, except searching from the beginning of the string (as I don't want to match "bean" in "superbean").

I chose php for the ranking based on the fact that I'm using js libraries, and getting real benchmarks while using convenience functions and library overhead wouldn't produce the testable results I'm after, and I can't guarantee that it won't explode in one browser or another.

Here's the test data:

Search String:
lima beans


Result array (from db):
["Beans, kidney","Beans, lima","Beans, navy","Beans, pinto","Beans, shellie","Beans, snap","Beans, mung","Beans, fava","Beans, adzuki","Beans, baked","Beans, black","Beans, black turtle soup","Beans, cranberry (roman)","Beans, french","Beans, great northern","Beans, pink","Beans, small white","Beans, yellow","Beans, white","Beans, chili","Beans, liquid from stewed kidney beans","Stew, pinto bean and hominy"]


First, I drop both the search string and the result array into php variables, after
explode()
ing the string on spaces.

then, I precompile my patterns to compare the results to:

$max = max(array_map('strlen',$input));
$reg = array();
for($m = 0; $m < $max; $m++) {
$reg[$m] = "";
for($ia = 0; $ia < count($input); $ia++) {
$reg[$m]. = $input[$ia][$m];
}
}


this gives me something like :
["lb","ie","ma","an","s"]


then, I basically take each result string (split on spaces), and match a case insensitive character class with the corresponding character number to it. If at any point during that comparison process I don't get any matches, I skip the word. This means if only 1 result starts with "b" or "l", I'll only run one comparison per WORD, which is really fast. Basically I'm taking the part of trie that compiles the searches together, and the constant speedup of the Boyer-Moore stuff.

Here's the php - I tried
while
s, but got SIGNIFICANTLY better results with
foreach
es:

$sort = array();
foreach($results as $result) {
$matches = 0;
$resultStrs = explode(' ', $result);
foreach($resultStrs as $r) {
$strlen = strlen($r);
for($p = 0; $p < $strlen; $p++) {
if($reg[$p])
preg_match('/^['.$reg[$p].']/i',$r[$p],$match);
if($match==true) {
$matches++;
} else {
break 2;
}
}
}
$sort[$result] = $matches;
}


That outputs an array with the results on the keys, and how many character matches we got in total on the values.

The reason I put it that way is to avoid key collisions that would ruin my data, and more importantly, so I can do a quick
asort
and get my results in order.

That order is in reverse, and on the keys, so after the above code block, I run:

asort($sort);
$sort = array_reverse(array_keys($sort));


That gives me a properly indexed array of results, sorted most to least relevant. I can now just drop that in my autocomplete box.

Because speed is the whole point of this experiment, here's my results - obviously, they depend partially on my computer.

2 input words, 40 results: ~5ms
2 input words, (one single character, one whole) 126 results: ~9ms

Obviously there's too many variables at stake for these results to mean much to YOU, but as an example, I think it's pretty impressive.

If anyone sees something wrong with the above example, or can think of a better way than that, I'd love to hear about it. The only thing I can think of maybe causing problems right now, is if I were to search for the term
lean bimas
, I would get the same result score as
lima beans
, because the pattern isn't conditional based on the previous matches. Because the results I'm looking for and the input strings I'm expecting shouldn't make this happen very often, I've decided to leave it how it is for now, to avoid adding any overhead to this quick little script. However, if I end up feeling like my results are being skewed by it, I'll come back here and post about how I sorted that part.

Answer

Since you specifically noted that it could be in several languages, I'll leave my answer in pseudocode so you can adapt to the language of your choice.

Since you are matching an array-to-array, performance is going to vary a lot based on your implementation, so trying several ways and considering exactly when/how/how-often this will be used in a good idea.

The simple way is to leave the data as-is and run an O(n^2) search:

for (every element in array X)
    for (every element in array Y)
        if (current X == current Y)
            add current X to results

return results

If you sorted the arrays first (a sorting algorithm such a squicksort is implemented for you in many languages, check your documentation!), then the actual matching is quicker. Use whatever string-comparison your language has:

Sort array X
Sort array Y

Let A = first element of X
Let B = first element of Y

while (A and B are in array)
    if (A > B)
        Next B
    else if (A < B)
        Next A
    else  //match!
        Add A to results
        Next A
        Next B

//handle case where one array is larger (trivial loop)

return results

Now the important part to the above solution is if the sorting of the arrays saved time versus just an ordinary O(n^2) sort. Usually, moving elements around in arrays is fast whereas string comparisons are not, so it may be worth it. Again, try both.

Finally, there's this crazy algorithm the Mailinator guy dreamed up for doing huge string comparisons in constant time using some awesome data structures. Never tried it myself, but it must work since he runs his whole site on very low-end hardware. He blogged about it here if you're interested. (Note: the blog post is about filtering spam, so some words in the post may be slightly NSFW.)