deadfish deadfish - 1 month ago 14
C# Question

What algorithms are used for seaching subwords in words?

I know, maybe this question is stupid but I am stuck and I need real help. I need in my project use algorithm that could find all words what starts by another and return all words' tails. For example:
Find all words that start by:

dad


In dictionary we have:

dada, dadaism, daddled, daddling


Result:

a, aism, dled, dling


I have dictionary with all words, so all what I need is only algorithm. Someone suggested me to use patricia algorithm but I couln't find any sample for C#. My dictionary is very big so I need find also very fast algorithm.




More information:
  • Dictionary is sorted.

  • Answer

    How you make this work will depend on how your dictionary is arranged. If it's a sorted list of words, then you can use binary search to find the first word that starts with "dad", and then loop through just those using StartsWith and Substring. That is:

    List<string> Words = LoadWords(); // however you load them
    Words.Sort();
    
    // Now, search for "dad" (or whatever)
    string prefix = "dad";
    
    int index = Words.BinarySearch(prefix);
    
    // If the returned index is negative, the word wasn't found.
    // The index is the one's compliment of the the place where it would be in the list.
    if (index < 0)
    {
        index = ~index;
    }
    
    for (int i = index; i < Count && Words[i].StartsWith(prefix))
    {
        Console.WriteLine(Words[i].Substring(prefix.Length));
    }
    

    This should be very fast. The sort is a one-time cost after loading. And you can eliminate it altogether if you store the dictionary in sorted order. The binary search is O(log n), where n is the number of words in the dictionary.

    If your dictionary is unordered, then you'll have to go through all the words, which is going to take a lot of time.

    There are other organizations for your dictionary, that will make it take a lot less space and that could potentially be faster. Those are somewhat more complicated and take a lot more time to build than creating a sorted list.