rjh rjh - 4 months ago 7
Perl Question

Why is `\s+` so much faster than `\s\s*` in this Perl regex?

Why does replacing

\s*
(or even
\s\s*
) with
\s+
result in such a speedup for this input?

use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
'/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
'/\s+\n/' => sub { $x =~ /\s+\n/ },
});


Link to online version

I noticed a slow regex
s/\s*\n\s*/\n/g
in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished.

I intuitively replaced the regex with
s/\s+\n/\n/g; s/\n\s+/\n/g;
and all was well.

But why is it so much faster? After using
re Debug => "EXECUTE"
I noticed the
\s+
version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against " _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against " _%n" (9 bytes)
0 <> < _%n> | 1:STAR(3)
SPACE can match 7 times out of 2147483647...
failed...
1 < > < _%n> | 1:STAR(3)
SPACE can match 6 times out of 2147483647...
failed...
2 < > < _%n> | 1:STAR(3)
SPACE can match 5 times out of 2147483647...
failed...
3 < > < _%n> | 1:STAR(3)
SPACE can match 4 times out of 2147483647...
failed...
4 < > < _%n> | 1:STAR(3)
SPACE can match 3 times out of 2147483647...
failed...
5 < > < _%n> | 1:STAR(3)
SPACE can match 2 times out of 2147483647...
failed...
6 < > < _%n> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
8 < _> <%n> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
8 < _> <%n> | 3: EXACT <\n>(5)
9 < _%n> <> | 5: END(0)
Match successful!




Matching REx "\s+\n" against " _%n"
Matching stclass SPACE against " _" (8 bytes)
0 <> < _%n> | 1:PLUS(3)
SPACE can match 7 times out of 2147483647...
failed...


I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually
/\s*\n/
against a string of spaces would take exponential-time). Can anyone offer insight into why the
\s+
version is so much faster?

Also note that
\s*?
does not offer any speedup.

Answer

The \s+\n requires that the character preceding the \n is a SPACE. So it greedily goes to find the last \n, the previous character is not a space, and we are done. It will then scan the string for earlier \n and find none. It's one shot. (If there were more \n each would be found to fail equally directly.)

According to use re qw(debug) compilation establishes that it needs a straight string of a known number of spaces, up to the fixed substring \n which is checked for in the input. Then it checks the fixed-length space-substring against the remaining part of input, failing right away for _. It's a single possibility, regardless of how many spaces the input has.

Looked at it this way, it's an optimization you'd almost expect, utilizing a rather specific search pattern and lucking out with this input. Except when taken in comparison with other engines of course, which clearly don't do this kind of analysis.

With \s*\n this is not the case. Once the \n is found and the previous character is not a space, the search hasn't failed. There are no fixed-length substrings. Thus it's in the backtracking game.

Comments