Acorn Acorn - 3 months ago 38
C++ Question

Using std::regex to filter input

I have an ugly mess of a string, that is composed of several URIs.

:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/0_301_0.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02011.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02012.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02110000.svg


What I would like to do is strip out every occurrence of the characters
:/.,
, so I can have a single string that would be a valid filename.

I've written this simple regex expression in order to do jus that:
[^(:/,.)]

It seems to be the correct regex expression, according to http://www.regexpal.com/.

However, when I run the following C++ code, I do not get back what I was expecting(just alphanumeric characters and underscores), I just get back the first alphanumeric character in the sequence:
S
.

What am I doing incorrectly with std::regex, or is my regex expression off?

#include <iostream>
#include <regex>
#include <string>

static const std::string filenames {R"(:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/0_301_0.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02011.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02012.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02110000.svg)"};
static const std::regex filename_extractor("[^(:/,.)]");

int main() {
std::smatch filename_match;
if(std::regex_search(filenames, filename_match, filename_extractor))
{
std::cout << "Number of filenames: " << filename_match.size() << std::endl;
for(std::size_t i = 0; i < filename_match.size(); ++i)
{
std::cout << i << ": " << filename_match[i] << std::endl;
}
}

return 0;
}

Answer

The size() of std::smatch returns the number of sub-expression + 1 (with ( and ), which you do not have).

Solution

You need to call std::regex_search repeatedly, or use std::regex_iterator.

In addition, your regex actually searched only for a single character. You need to use a + to search for the longest character sequences: [^(:/,.)]+.

Here is your code, incorporating the example from cppreference.com:

#include <iostream>
#include <iterator>
#include <regex>
#include <string>

static const std::string filenames {R"(:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/0_301_0.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02011.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02012.svg,:/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02110000.svg)"};
static const std::regex filename_extractor("[^(:/,.)]+");

int main() {
    auto files_begin = std::sregex_iterator(filenames.begin(), filenames.end(), filename_extractor);

    for (auto i = files_begin; i != std::sregex_iterator(); ++i) {
        std::string filename = i->str(); 
        std::cout << filename << '\n';
    }   

    return 0;
}

However, this returns also the intermediate "directories". If you use the regex [^(:,)]+, you get the result I would expect you wanted to have:

/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/0_301_0.svg
/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02011.svg
/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02012.svg
/SymbolStandards/JMSymbology/MIL_STD_2525D_Symbols/02110000.svg

Your problem explained

std::regex_search searches only for the first occurence of the regular expression, and any sub-expressions within.

For example, the expression ab([cd])([ef]) will match the string xxabcfxxabdef. The first match is the part abcf, with c being the match for the first sub-expression [cd] and e being the match for the second sub-expression [ef].

The second match is the part abde (not abdef!), where e is the match for the second sub-expression.

With std::regex_search, you search for the first match, and the matcher returns you the complete first match and the matches for the sub-expressions. If you want to find further matches, you have to start the search from the rest of the string (std::smatch::suffix()).

In addition, the regex [ef] matches only a single character. [ef]+ would match the longest sequence of es and fs. Thus, the match for the second sub-expression of ab([cd])([ef]) for the target string above would match ef, and not just e.

Comments