CodeMonkey CodeMonkey - 27 days ago 6
C++ Question

C++ Fastest text line comparisons with n-Gram? Use strings, char*, vectors?

so I'm currently writing something which would compare columns in a data table. So I basically need Rows, Columns, and all are text read in from a .csv file.

The issue currently is how long it takes to process. I have duplicate code in C# which uses a 'DataTable' with Rows and Columns which takes around 13 seconds to process 1500 lines of text.
My c++ program uses 'vector < vector >' as the table, and processes only 175 lines of text in this amount of time.
Algorithms are the same from what I can see, but I'm guessing it's something related to the data types I"m using in c++ that makes this slower.

Does anyone have ideas on the cause?

//C++
void CheckColumn(int colNum)
{
for (int i = 1; i < RowCount; i++)
{
for (int j = i + 1; j < RowCount; j++)
{
nGram(Data[colNum][i], Data[colNum][j], 3);
}
}
}
double nGram(string one, string two, int count)
{
//Change to uppercase
transform(one.begin(), one.end(), one.begin(), toupper);
transform(two.begin(), two.end(), two.begin(), toupper);

//Set the first string to be the shorter one
if (one.size() > two.size())
{
string temp = one;
one = two;
two = temp;
}

//If nGram number is larger than the shortest string, set the nGram number to that length
if (one.size() < count)
{
count = one.size();
}

//Add matches
double weight = 0;
double possibleMatches = (2 * two.size() - 2 * count + 2) / 2;
for (int i = 0; i < (one.size() - count + 1); i++)
{
for (int j = 0; j < (two.size() - count + 1); j++)
{
if (one.substr(i, count) == two.substr(j, count))
{
weight += 1;
break;
}
}
}

//Check for indivisible situations and otherwise calculate the weight
if ((possibleMatches == 0) || (weight == 0))
{
weight = 0;
}
else
{
weight = weight / possibleMatches;
}

return weight;
}


//C#
void CompareColumn(int colNum)
{
for(int i = 0; i < table.Rows.Count; i++)
{
for (int j = i + 1; j < table.Rows.Count; j++)
{
StringFunctions.nGram(table.Rows[i][colNum].ToString(), table.Rows[j][colNum].ToString(), 3);
}
}
}
public double nGram(string one, string two, int count)
{
//Change to uppercase
one = one.ToUpper();
two = two.ToUpper();

//Set the first string to be the shorter one
if (one.Length > two.Length)
{
string temp = one;
one = two;
two = temp;
}

//If nGram number is larger than the shortest string, set the nGram number to that length
if (one.Length < count)
{
count = one.Length;
}

//Add matches
double doubleMatch = 0;
double possibleMatches = (2 * two.Length - 2 * count + 2)/2;
for (int i = 0; i < (one.Length - count + 1); i++)
{
for (int j = 0; j < (two.Length - count + 1); j++)
{
if (one.Substring(i, count) == two.Substring(j, count))
{
doubleMatch += 1;
break;
}
}
}

//Check for indivisible situations and otherwise calculate the weight
if ((possibleMatches == 0) || (doubleMatch == 0))
{
doubleMatch = 0;
}
else
{
doubleMatch = doubleMatch / possibleMatches;
}
return doubleMatch;
}

Answer Source

double nGram(string one, string two, int count)

Without diving into your code too much, the method above is one of the first issues I noticed. In C++ you can pass a value by copy, reference, or pointer. The method you're using is by copy and creates overhead when you have a method being called a lot.

To illistrate take the following class.

class CopyMe
{

public:
    CopyMe();
    CopyMe(const CopyMe&m);
    ~CopyMe();
};


CopyMe::CopyMe()
{
    std::cout << "Created! Instance @ Location: " << this << std::endl;
}

CopyMe::CopyMe(const CopyMe& m)
{
    std::cout << "Copy! Instance @ Location: " << &m << " New Location: " << this << std::endl;
}

CopyMe::~CopyMe()
{
    std::cout << "Deleted! Instance @ Location: " << this << std::endl;
}

and the following short main.

void SayHelloToMyLittleFriendViaCopy(CopyMe aCopy)
{
    std::cout << "Inside " << __PRETTY_FUNCTION__ << std::endl;
}

void SayHelloToMyLittleFriendViaReference(CopyMe& aReference)
{
    std::cout << "Inside " << __PRETTY_FUNCTION__ << std::endl;
}

int main(int argc, const char * argv[])
{
    CopyMe me;
    SayHelloToMyLittleFriendViaCopy(me);

    SayHelloToMyLittleFriendViaReference(me);
    return 0;
}

If you run the code above you will get the following output. Note Your memory locations will be different!

Created! Instance @ Location: 0x7fff5fbff7d8
Copy! Instance @ Location: 0x7fff5fbff7d8 New Location: 0x7fff5fbff7d0
Inside void SayHelloToMyLittleFriendViaCopy(CopyMe)
Deleted! Instance @ Location: 0x7fff5fbff7d0
Inside void SayHelloToMyLittleFriendViaReference(CopyMe &)
Deleted! Instance @ Location: 0x7fff5fbff7d8

If you compare the two calls you will notice that the call to SayHelloToMyLittleFriendViaCopy has overhead in that it creates a copy, runs the method, and then deletes the temporary object it created. In contrast the call to SayHelloToMyLittleFriendViaReference incurs no such cost in that the function is called an no temporary copy is created. The above may be confusing in that the object destructor is called at the end of the example. however, that is not caused by the function, but the fact that we are exerting the main method and our initial object is destroyed.

If this is the first time you have heard of pass by reference you can think of it as passing pointer. However, instead of you having to dereference the object, the compiler adds the sugar for you.

However, if you pass by reference there is the possibility that the method could changes the state of the object. If that is something you want to avoid you can still pass by reference, but with a const reference. If that is what you're after then your signature should be

double nGram(const string& one,const string& two, int count)