I was parsing stackoverflow dump and came up on this seemingly innocent question with small, almost invisible detail that it has 22311 spaces at the end of text.
I'm using std::regex (somehow they work better for me than boost::regex) to replace all continuous whitespaces with single space like this:
std::regex space_regex("\\s+", std::regex::optimize);
...
std::regex_replace(out, in, in + strlen(in), space_regex, " ");
#include <regex>
...
std::regex r("\\s+", std::regex::optimize);
const char* bomb2 = "Small text\n\nwith several\n\nlines.";
std::string test(bomb2);
for (auto i = 0; i < N; ++i) test += " ";
std::string out = std::regex_replace(test.c_str(), r, " ");
std::cout << out << std::endl;
$ g++ -O3 -std=c++14 regex-test.cpp -o regex-test.out
N
$ g++ -O0 -std=c++14 regex-test.cpp -o regex-test.out
-O0
-O3
-O2
std::__detail::_Executor<char*, std::allocator<std::__cxx11::sub_match<char*> >, std::__cxx11::regex_traits<char>, true>::_M_dfs
Yes this is a bug.
cout << '"' << regex_replace("Small text\n\nwith several\n\nlines." + string(22311, ' '), regex("\\s+", regex::optimize), " ") << '"' << endl;
But this is just a bug against libstdc++ so feel free to report it here: https://gcc.gnu.org/bugzilla/buglist.cgi?product=gcc&component=libstdc%2B%2B&resolution=---
If you're asking for a new regex
that works, I've tried a handful of different versions, and all of them fail on libstdc++, so I'd say, if you want to use a regex
to solve this, you'll need to compile against libc++.
But honestly if you're using a regex
to strip duplicate white space, "Now you have two problems"
A better solution could use adjacent_find
which runs fine with libstdc++ as well:
const auto func = [](const char a, const char b){ return isspace(a) && isspace(b); };
for(auto it = adjacent_find(begin(test), end(test), func); it != end(test); it = adjacent_find(it, end(test), func)) {
*it = ' ';
it = test.erase(next(it), find_if_not(next(it), end(test), [](const auto& i) { return isspace(i); }));
}
This will return the same thing your regex
would:
"Small text with several lines. "
But if you're going for simplicity, you could also use unique
:
test.resize(distance(test.begin(), unique(test.begin(), test.end(), [](const auto& a, const auto& b) { return isspace(a) && isspace(b); })));
Which will return:
"Small text
with several
lines. "