RadioTransmission RadioTransmission - 1 month ago 6
C++ Question

c++ std::regex, smatch retains subexpressions only once for their apperance in a pattern string

I have the following code:

int main()
{
regex reg_expr("(\\([A-Z],[A-Z]\\))(?:\\s(\\([A-Z],[A-Z]\\)))*");
//regex reg_expr("(\\([A-Z],[A-Z]\\))(?:\\s(\\([A-Z],[A-Z]\\)))*\\s(\\([A-Z],[A-Z]\\))");
smatch sm;
string input("(A,B) (C,D) (F,W) (G,K) (R,M)");
//string input("(A,B) (C,D) (F,W)");
if (regex_match(input, sm, reg_expr)) {
cout << "true\n";
cout << sm.size() << "\n";
for (int i = 0; i < sm.size(); i++) {
//if (sm[i].length())
cout << "submatch number " << i << ": " << sm[i].str() << '\n';
}
} else
cout << "false";
return 0;
}


Everything works fine except that "smatch sm" has only one substring for each subexpression specified in the regular expression string.
For example, for the following test string:

"(A,B) (C,D) (F,W) (G,K) (R,M)",

which is correctly matched against the

"(\([A-Z],[A-Z]\))(?:\s(\([A-Z],[A-Z]\)))*"

regular expression, the "sm" has only tree substrings: one is for the whole string, and the other two are "(A,B)" and "(R,M)", the "(C,D)", "(F,W)", "(G,K)" are missing but they are matched.
It looks like the "(?:\s(\([A-Z],[A-Z]\)))*" is understood correctly by regex that 0 or more of the subexpressions should be matched, but there seems to be an error that just one subexpression is stored in the "std::smatch sm".
Is it a library error (which is less likely) or I am doing something wrong? Your help and advice is wellcome!

Answer

It is not a bug, but almost a universal behavior (except for PyPi Python regex module, .NET and (if compiled with appropriate options) Boost) when repeated captures only store the last matched item in its buffer.

For more details, read Repeating a Capturing Group vs. Capturing a Repeated Group article.

In your case, you may use a regular std::sregex_iterator:

int main() {
    std::regex reg_expr(R"(\([A-Z],[A-Z]\))");
    string input("(A,B) (C,D) (F,W) (G,K) (R,M)");
    for(std::sregex_iterator i = std::sregex_iterator(input.begin(), input.end(), reg_expr);
        i != std::sregex_iterator();
        ++i)
    {
        std::cout << (*i).str() << std::endl;
    }
    return 0;
}

See the C++ demo

Note I am using a raw string literal R"(...)" where only 1 backslash is needed to escape regex metacharacters.

Comments