TheJr TheJr - 4 months ago 12
C++ Question

Most efficient way to remove punctuation marks from string in c++

I'm trying to find the most efficient way to remove punctuation marks from a string in c++, this is what I currently have.

#include <iostream>
#include <string>
#include <fstream>
#include <iomanip>
#include <stdlib.h>
#include <algorithm>

using namespace std;

void PARSE(string a);

int main()
{
string f;
PARSE(f);
cout << f;
}

void PARSE(string a)
{
a = "aBc!d:f'a";

a.erase(remove_if(a.begin(), a.end(), ispunct), a.end());

cout << a << endl;
}


Is there a easier/more efficient way to do this?

I was thinking using str.len, get the length of the string, run it through a for loop and check ispunct then remove if it is.

Answer

No string copies. No heap allocation. No heap deallocation.

void strip_punct(string& inp)
{
    auto to = begin(inp);   
    for (auto from : inp)
        if (!ispunct(from)) 
            *to++ = from;
    inp.resize(distance(begin(inp), to));
}

Comparing to:

void strip_punct_re(string& inp)
{
    inp.erase(remove_if(begin(inp), end(inp), ispunct), end(inp));
}

I created a variety of workloads. As a baseline input, I created a string containing all char values between 32 and 127. I appended this string num-times to create my test string. I called both strip_punct and strip_punct_re with a copy of the test string iters-times. I performed these workloads 10 times timing each test. I averaged the timings after dropping the lowest and highest results. I tested using release builds (optimized) from VS2015 on Windows 10 on a Microsoft Surface Book 4 (Skylake). I SetPriorityClass() for the process to HIGH_PRIORITY_CLASS and timed the results using QueryPerformanceFrequency/QueryPerformanceCounter. All timings were performed without a debugger attached.

 num        iters      seconds      seconds (re)    improvement
10000        1000        2.812        2.947             4.78%  
 1000       10000        2.786        2.977             6.85%
  100      100000        2.809        2.952             5.09%

By varying num and iters while keeping the number of processed bytes the same, I was able to see that the cost is primarily influenced by the number of bytes processed rather than per-call overhead. Reading the disassembly confirmed this.

So this version, is ~5% faster and generates 30% of the code.