Thalhammer Thalhammer - 1 month ago 8
C++ Question

Validation for error

while developing my personal library I stumbled upon what I think is an error inside libstdc++6.

Because I'm quite sure this library has been reviewed by a lot of much higher skilled people than I am I came here to validate my finding and get assistance on further steps.

Consider the following code:

#include <regex>
#include <iostream>

int main()
{
std::string uri = "http://example.com/test.html";
std::regex reg(...);
std::smatch match;
std::regex_match(uri, match, reg);
for(auto& e: match)
{
std::cout<<e.str() <<std::endl;
}
}


I have written a regex to parse a URL into


  • Protocol

  • User/Pass (optional)

  • Host

  • Port (optional)

  • Path (optional)

  • Query (optional)

  • Location (optional)



I used the following regex (in c++):

std::regex reg("^(.+):\\/\\/(.+@)?([a-zA-Z\\.\\-0-9]+)(:\\d{1,5})?([^?\\n\\#]*)(\\?[^#\\n]*)?(\\#.*)?$");


This worked quite fine in a online tester and MSVC++ 2015 Update 3 but fails on my build host where the host part matches both host and path.

Buildhost:


g++ (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609

libstdc++6:amd64 5.4.0-6ubuntu1~16.04.2


I consider this an error because if I change the regex to this:

std::regex reg("^(.+):\\/\\/(.+@)?([a-zA-Z\\.0-9\\-]+)(:\\d{1,5})?([^?\\n\\#]*)(\\?[^#\\n]*)?(\\#.*)?$");


It works fine althought it should behave exactly the same.

Failing regex: https://ideone.com/7n2JdK

Working regex: https://ideone.com/6NMPUW

Do I miss something really important here or is this an error within libstdc++6 ?

The only difference is on the char class:

[a-zA-Z\\.\\-0-9] // not working
[a-zA-Z\\.0-9\\-] // working

Answer

This is clearly a bug because "[.\\-0]" should be parsed as a character class matching a character that is either . or - (since the hyphen is escaped with a literal \) or a 0. For an unknown reason, the hyphen is parsed as a range operator and the [a-zA-Z\\.\\-0-9]+ subexpression becomes equal to [a-zA-Z.-0-9]+. See this regex demo.

The second expression works because a - at the end of the character class (and at its start) is always parsed as a literal hyphen.

Another example of the same bug:

std::string uri = "%";
std::regex reg(R"([$\-&])");
std::smatch match;
std::regex_match(uri, match, reg);
for(auto& e: match)
{
   std::cout<< e.str() <<std::endl;
}

The [$\-&] regex should not match %, it should match $, - or &, but for whatever reason, the % (that is between $ and & in the ASCII table) is still matched.