Braydon Batungbacal Braydon Batungbacal - 1 month ago 14
C Question

Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurences in string 2?

I've run into an unusual challenge and so far I'm unable to determine the most efficient algorithm to attack this.




Given the following 2 strings as an example, find all commonly shared substrings between the 2 strings of any length, and count the number of occurences of all of those shared substrings in string 2. Your algorithm also needs to be able to compute shared substrings between files containing strings that are up to 100mb or more in size.

Example:

String 1: ABCDE512ABC361EG51D

String 2: ADE5AHDW4131EG1DG5C

Given these 2 strings this algorithm would find the following shared substrings:
A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

And then from these commonly shared substrings, we'd find how many occurences there are of each of them in string 2.

A: 2 occurrences in string 2

C: 1 occurence in string 2

D: 3 occurrences in string 2

etc..




The first approach I took to wrap my head around this problem was brute forcing my way through computing the common shared substrings using 2 nested for loops - obviously the least efficient but it was a quick and dirty way to get an idea of what the expected outputs should be with smaller test input and the slowest possible time to run, which was around 2 minutes to compute all common shared substrings between 2 files containing ascii strings with a size of 50kb. Upping the size to 1mb made this come to a screeching halt due to the massive number of total nested iterations that had to occur to compute this.

The next approach was using trees - seeing how much memory I could trade off to optimize compute time. This approach was much faster. The same two 50kb files that took 2 minute with the brute force method were near instant. Running against 1mb files was very fast too still (seconds) but as I continued to test with larger and larger file sizes, I quickly began running into memory issues due to tree sizes.




Note: The string files will only ever contain ASCII characters!

Answer

Here is some code illustrating the idea I presented in the comments above. Although it is runnable C++ code, it is more pseudo-code in the sense that the utilized data structures are surely not optimal but they allow a clear view on the algorithm.

struct Ocurrence
{
    //The vectors contain indices to the first character of the occurence in ...
    std::vector<size_t> s1;  // ... string 1 and ...
    std::vector<size_t> s2;  // ... string 2.
};

int main()
{
    //If you cannot load the entire strings in memory, a memory-mapped file might be
    //worth considering
    std::string s1 = "ABCDE512ABC361EG51D";
    std::string s2 = "ADE5AHDW4131EG1DG5C";

    //These vectors store the occurences of substrings for the current and next length
    std::vector<Ocurrence> occurences, nextOccurences;
    int length = 1;

    std::map<char, Ocurrence> occurenceMap;
    //Initialize occurences
    for (int i = 0; i < s1.length(); ++i)
        occurenceMap[s1[i]].s1.push_back(i);
    for (int i = 0; i < s2.length(); ++i)
        occurenceMap[s2[i]].s2.push_back(i);

    for (auto& pair : occurenceMap)
    {
        if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
            occurences.push_back(std::move(pair.second));
    }

    do
    {
        nextOccurences.clear();

        std::cout << "Length " << length << std::endl;
        for(auto& o : occurences)
        {
            std::cout << std::string(s1.c_str() + o.s1[0], length) << " occurred "
                      << o.s1.size() << " / " << o.s2.size() << " times." << std::endl;

            //Expand the occurence
            occurenceMap.clear();
            for (auto p : o.s1)
            {
                if (p + length < s1.length())
                    occurenceMap[s1[p + length]].s1.push_back(p);
            }                   
            for (auto p : o.s2)
            {
                if (p + length < s2.length())
                occurenceMap[s2[p + length]].s2.push_back(p);
            }
            for (auto& pair : occurenceMap)
            {
                if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                    nextOccurences.push_back(std::move(pair.second));
            }
        }

        ++length;
        std::swap(occurences, nextOccurences);

    } while (!occurences.empty());


    return 0;
}

Output:

Length 1
1 occurred 3 / 3 times.
3 occurred 1 / 1 times.
5 occurred 2 / 2 times.
A occurred 2 / 2 times.
C occurred 2 / 1 times.
D occurred 2 / 3 times.
E occurred 2 / 2 times.
G occurred 1 / 2 times.
Length 2
1D occurred 1 / 1 times.
1E occurred 1 / 1 times.
DE occurred 1 / 1 times.
E5 occurred 1 / 1 times.
EG occurred 1 / 1 times.
G5 occurred 1 / 1 times.
Length 3
1EG occurred 1 / 1 times.
DE5 occurred 1 / 1 times.

The most amount of memory will be used during initialization because there will be an entry for every character of both input strings. If you know the approximate length of the strings, you can choose a more appropriate index data type than size_t. The amount of memory needed is in the order of the input size. So two 100 MB files should be no problem for common computers. After the initialization (more specifically, after the first iteration of the loop), most of these data will be deleted because it is not needed any more.