Mikk3lRo Mikk3lRo - 3 years ago 135
Javascript Question

How to match and highlight all terms in any order from an array of strings?

The requirements are:


  • Find strings from an array (from here on called options) that include ALL the terms in ANY order

  • Correctly highlight the matching terms - ie. insert a string before and after each matching term - I'm using
    <u>
    and
    </u>
    here

  • Both the search query and options can be "anything"



For the sake of simplicity answers can concentrate on case insensitive search through a list including only ascii characters and assume that the term delimiter is a plain space - ie. a query entered as "Foo bar baz" means the search terms are
['foo', 'bar', 'baz']
.

To clarify:


  • All terms must exist separately from each other in matched options - ie. shorter terms should not exist only as substrings of longer terms and no two terms should overlap

  • Repeated search terms must exist in the option at least as many times as in the query



The final application is (perhaps not surprisingly) an autocomplete of sorts.


TL;DR



Most recent fiddle comparing the proposed algorithms side by side:

https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/

(feel free to update this link if you add a new algorithm)

jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:

https://jsperf.com/comparison-of-algorithms-to-search-and-highlight

At this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. It's significance has been minimized as much as possible in the fiddle.

There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperforms the others consistently in any realistic scenario I try though.

Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.

Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.

Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.





Viral Permutations



In theory this can be solved by "manually" constructing a RegExp that contains every possible permutation of the search terms separated by a capturing group to catch anything between terms - a search for "foo bar" results in
(foo)(.*?)(bar)|(bar)(.*?)(foo)
.

The highlighting is then done in one pass with
string.replace()
. If there is any change in the string we have a match.

Demo:



var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
meta_elm = document.getElementById('viral_permutations_meta');
res_elm = document.getElementById('viral_permutations_result');
res_elm.innerHTML = meta_elm.innerHTML = '';

t0 = performance.now();

//Split query in terms at delimiter and lowercase them
terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) {
return n;
}).map(function(term){
return term.toLowerCase();
});
meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';

//Escape terms
terms_esc = terms.map(function(term) {
return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
});

//Wrap terms in in individual capturing groups
terms_esc = terms.map(function(term) {
return '(' + term + ')';
});

//Find all permutations
permuted = permutate_array(terms_esc);

//Construct a group for each permutation
match_groups = [];
for (var i in permuted) {
match_groups.push(permuted[i].join('(.*?)'));
}

try {
//Construct the final regex
regex_string = match_groups.join('|');
//Display it
document.getElementById('viral_permutations_regex').innerHTML = regex_string;
meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';

regex = new RegExp(regex_string, 'i');


//Loop through each option
for (i = 0; i < options.length; i++) {
option = options[i];

//Replace the terms with highlighted terms
highlighted = option.replace(regex, viral_permutations_replace);

//If anything was changed (or the query is empty) we have a match
if (highlighted !== option || terms.length === 0) {
//Append it to the result list
li = document.createElement('li');
li.innerHTML = highlighted;
res_elm.appendChild(li);
}
}

//Print some meta
t1 = performance.now();
meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
} catch(e) {
meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
}
}

//The replacement function
function viral_permutations_replace() {
var i, m, j, retval, m_cased, unmatched;
retval = '';

//Make a clone of the terms array (that we can modify without destroying the original)
unmatched = terms.slice(0);

//Loop arguments between the first (which is the full match) and
//the last 2 (which are the offset and the full option)
for (i = 1; i < arguments.length - 1; i++) {
m = arguments[i];

//Check that we have a string - most of the arguments will be undefined
if (typeof m !== 'string') continue;

//Lowercase the match
m_cased = m.toLowerCase();

//Append it to the return value - highlighted if it is among our terms
j = unmatched.indexOf(m_cased);
if (j >= 0) {
//Remove it from the unmatched terms array
unmatched.splice(j, 1);
retval += '<u>' + m + '</u>';
} else {
retval += m;
}
}
return retval;
}

//Helper function to return all possible permutations of an array
function permutate_array(arr) {
var perm, perm_intern;
perm_intern = function(perm, pre, post, n) {
var elem, i, j, ref, rest;
if (n > 0) {
for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
rest = post.slice(0);
elem = rest.splice(i, 1);
perm_intern(perm, pre.concat(elem), rest, n - 1);
}
} else {
perm.push(pre);
}
};
perm = [];
perm_intern(perm, [], arr, arr.length);
return perm;
}
viral_permutations();

<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>





Thanks to trincot for pointing out that my original version would occasionally highlight a recurring term twice - which is fixed in this snippet.

Fails because:


  • The regex grows exponentially as terms are added. At 7 terms (even single letters) it is past 250kb and my browser gives up with
    Error: regexp too big
    ...






A few other noteworthy strategies that didn't work:



Capturing groups with all terms in each group:



(foo|bar)(.*)(foo|bar)


Fails because:


  • Will match options that include repeated terms - fx.
    The food in the food court
    would match though it obviously shouldn't.

  • If we "double-check" that all terms were in fact found it will fail to find valid matches - fx.
    The food in the food bar
    would find
    foo
    twice and never get to
    bar
    .



Negative lookaheads and backreferences:



(foo|bar|baz)(.*?)((?!\1)(?:foo|bar|baz))(.*?)((?!\1|\3)(?:foo|bar|baz))


Fails because:


  • Whenever terms are repeated in the query it will reach impossible conditions like "Find me a
    foo
    ,
    bar
    or
    bar
    which is not a
    foo
    nor a
    bar
    ".

  • I'm fairly certain it has other issues, but I stopped pursuing it when I realized it was logically flawed.



Positive lookaheads



(?=.*foo)(?=.*bar)(?=.*baz)


Fails because:


  • It is very difficult (if not impossible) to reliably highlight the found matches.

  • I haven't found any way to ensure all terms actually exist - ie. they may all be individually present in the option, but shorter terms may only exist as substrings of longer terms - or terms may overlap.


Answer Source

Divide and Conquer

Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.

Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".

When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").

If it comes back to the longest term and cannot find another position to match it at, it will return false.

This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.

Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.

Demo:

Updated for improved performance: The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice() that could be avoided completely.

function divide_and_conquer_replace(query, options, separator) {
    var terms, terms_esc;

    //The inner replacement function
    function divide_and_conquer_inner(bites, depth) {
        var this_term, i, bite, match, new_bites, found_all_others;

        depth = depth ? depth : 1;

        //Get the longest remaining term
        this_term = terms_esc[terms_esc.length - depth];

        //Loop all the bites
        for (i = 0; i < bites.length; i++) {
            bite = bites[i];

            //Reset the lastIndex since we're reusing the RegExp objects
            this_term.lastIndex = 0;

            //Check that we have a string (ie. do not attempt to match bites
            //that are already consumed)
            if (typeof bite === 'string') {
                //Find the next matching position (if any)
                while (match = this_term.exec(bite)) {
                    new_bites = (i > 0) ? bites.slice(0, i) : [];
                    if (match.index > 0) {
                        new_bites.push(bite.slice(0, match.index));
                    }
                    new_bites.push(['<u>' + match[0] + '</u>']);
                    if (this_term.lastIndex < bite.length) {
                        new_bites.push(bite.slice(this_term.lastIndex));
                    }
                    if (i < bites.length - 1) {
                        new_bites = new_bites.concat(bites.slice(i + 1));
                    }

                    if (terms_esc.length > depth) {
                        //Attempt to find all other terms
                        found_all_others = divide_and_conquer_inner(new_bites, depth + 1);

                        //If we found all terms we'll pass the modified string all the
                        //way up to the original callee
                        if (found_all_others) {
                            return found_all_others;
                        }
                        //Otherwise try to match current term somewhere else
                        this_term.lastIndex = match.index + 1;
                    } else {
                        //If no terms remain we have a match
                        return new_bites.join('');
                    }
                }
            }
        }
        //If we reach this point at least one term was not found
        return null;
    };

    // Split query in terms at delimiter
    terms = query.split(separator).filter(Boolean);
    if (!terms.length) return options;


    //Sort terms according to length - longest term last
    terms.sort(function(a, b) {
        return a.length - b.length;
    });

    //Escape terms
    //And store RegExp's instead of strings
    terms_esc = terms.map(function (term) {
        return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
    }).map(function (term) {
        return new RegExp(term, 'gi');
    });

    //Loop through each option
    return options.map(function(option){
        return divide_and_conquer_inner([option]);
    }).filter(Boolean);
}

var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';

function divide_and_conquer(){
    var query = document.getElementById('divide_and_conquer').value;
    var res_elm = document.getElementById('divide_and_conquer_result');
    var t0 = performance.now();
    var results = divide_and_conquer_replace(query, options, separator);
    var t1 = performance.now();
    document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
    res_elm.innerHTML = '';
    for (var result of results) {
        res_elm.innerHTML += '<li>' + result + '</li>';
    }
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>

This strategy has performance issues when the query consists exclusively of (usually very short) strings that are all / most present in many of the options - such as a a a a a a a a...

In realistic scenarios it is currently outperforming the other proposed algorithms - see the link to jsperf added to the question.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download