Misha Moroshko Misha Moroshko - 1 month ago 11
Javascript Question

What's the most appropriate data structure for autosuggest?

I'd like to implement an autosuggest component. For every user input, the component should provide zero or more suggestions.

For example, if user types

'park'
, suggestions could be:
['Parkville', 'Parkwood', 'Bell Park']
.

The requirements are simple:


  • It should be case insensitive (user should get the same suggestions for
    'park'
    ,
    'PARK'
    , and
    'PaRk'
    )

  • It should match at the beginning of every word (
    'pa'
    would match
    'Parkville'
    ,
    'Bell Park'
    , and
    'Very cool park'
    , but not
    'Carpark'
    )



What data structure would you choose to implement this in Javascript?

Are there any Javascript/Node.js libraries that can help?

Answer

I think the best data structure for such task is a trie. About case insetivity - just lowercase each word before adding to the trie and perform searching on lower cased words.

When you reach some point of the trie, there are number of children nodes which are satisfying strings - string with have the prefix from root to current point.

Output suggestions - recursively walk from current point(which is reached from root to user typed prefix) and print suggestion on nodes which marked as leafs. Stop printing after ~10 outputs because trie may have many satisfying words.

Here is js implementations: trie-js, trie and many others. Just search js+trie. May be trie+autosuggest+js will also work)

UPDATE

If you want to output all variants in O(1) (O(1) for each suggestion, of course), without recursive walk, you may store arraylist of references in each node. Arraylist stores indices of all words which belongs to node and each value is an index in other dictionary araylist.

Something like that:

Add word to dict:

  1. Check in O(word_len) is it in a trie (already added). If not, add to dictionary and remember index in "storage"

     if(!inTrie(word)){
        dict.push(word);
        index = dict.length-1; //zero based indices
        // now adding to trie
        for each node visited while addition to trie : node.references.push(index)
     }
    
  2. Search:

    Go to node with prefix==typed word;
    for(int i=0;i<min(curNode.references.length(),10);i++)
    print(dict[curNode.references[i]];
    

UPDATE

About 'pa' --> 'Very cool park'

You definetily should split phrase to separate words in order each word be "searchable" in a trie. BUT! As you treat phrase as one word, you should store it one cell.

I mean:

String phrase = "Very cool parl";
dict.push(phrase);
index = dict.length-1;

parts[] = split(phrase);
for part in parts{
 add part - for each node visited while addition of part perform node.references.push(index);
}

In other words, code of phrase the same with code of single word. And reference is the same, because we store all part together in one cell as a phrase. Difference is in splitting and adding pharse by parts. Simple, you see.

UPDATE

BTW, references storing is not so "expensive" in memory consumption - each letter in word is some node in a trie which means 1 entry in some arraylist(one integer index of this word in a global storage array). So, you have to only extra O(dictionary_length) memory, which is ~ 50000*20 = 1 000 000 integers ~ 4 MB, assuming that each word has at most 20 letters. So, upper bound of needed memory is 4 MB.

UPDATE

About 'e e' --> East Eagle.

OK, before posting any idea I would like to warn that this is very strange autosuggest behaviour and usually autosuggest matches one prefix but not all prefixes.

There is very simple idea which will increase searhcing complexity for such several prefixes parallel search for some delta where delta complexity = complexity of finding sets intersection.

  1. Now global storage contains not just indices but pairs <a,b> where a = index in storage, b = index in pharse. For simple word b=0 or -1 or any special value.
  2. Now each trie node reference arraylist contains pairs. When user types prefixes phrase, for example "ea ri", you shoud find 'ea' node as usual, iterate over refernces but take to account only those entries, where a=any, b=1, because ea index in typed phrase = 1. Put all such a indices where b=1 to some set. Find ri node as usual, iterate over references and put those a indices to other set where b=2 and so on. Find intersection of indices sets. Ouput storage words by index, where index belongs to intersection of sets.

When you searching not phrase but simple word you iterate over all reference items.