Jerry Jerry - 5 months ago 23
Javascript Question

Greediness behaving differently in JavaScript?

There was this question which made me realise that greediness of quantifiers is not always the same in certain regex engines. Taking the regex from that question and modifying it a bit:

!\[(.*?)*\]


(I know that
*
is redundant here, but I found what's following to be quite an interesting behaviour).


And if we try to match against:

![][][]


I expected to get the first capture group to be empty, because
(.*?)
is lazy and will stop at the first
]
it comes across. This is indeed what happens in:



I looked around with some other languages, for instance ruby, java, C# but all behave like I expected them to (i.e. return empty capture groups).

(regexplanet's golang flavour apparently also gets non-empty capture groups)

It seems that JavaScript's regex engine is interpreting the second
*
to convert
.*?
from lazy to greedy. Note that converting the second
*
to
*?
seems to make the regex work as I expected (as does removing the quantifier completely, because I know that it's being redundant in this situation, but that's not the point).

*
was used in the regex, but this behaviour is similar with
+
,
?
or
{m,n}
and converting them to their lazy version gives the same results as with
*?
.

Does anyone know what's really happening?

Answer

Short answer

The behavior is defined in ECMA-262 specs in section 15.10.2 Pattern Semantics, especially in 15.10.2.5 where it discusses the semantics of the production Term :: Atom Quantifier.

As a slight generalization: Let E be a pattern that can match empty string. If there exists an input string S where empty string is the first matchable choice in E, patterns which contain a greedy repetition of pattern E are affected. For example:

(a*?)*              against     aaaa
!\[(.*?)*\]         against     ![][][]
(.*?){1,4}          against     asdf
(|a)*               against     aaaa
(a|()|b){0,4}       against     aaabbb

Firefox and other Webkit-based browsers seems to follow the specs closely, while IE is off the specs in the case when there is no sequel to the affected pattern.

Long answer

The relevant part of the specs is quoted below. Some part of the specs is omitted ([...]) to keep the discussion relevant. I will explain by condensing the information from the specs while keeping it simple.

Vocabulary

  • A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is an internal array of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. [...]. Due to backtracking, many States may be in use at any time during the matching process.
  • A MatchResult is either a State or the special token failure that indicates that the match failed.

There should be no confusion here. State keeps track of the position that has been processed so far (and also the captures, which we are not interested in for the moment). MatchResult, well, is the match result (duh!).

  • A Continuation procedure is an internal closure (i.e. an internal procedure with some arguments already bound to values) that takes one State argument and returns a MatchResult result. If an internal closure references variables bound in the function that creates the closure, the closure uses the values that these variables had at the time the closure was created. The Continuation attempts to match the remaining portion (specified by the closure's already-bound arguments) of the pattern against the input String, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
  • A Matcher procedure is an internal closure that takes two arguments -- a State and a Continuation -- and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's already-bound arguments) of the pattern against the input String, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.

A Matcher contains code to match a subpattern that it represents, and it will call Continuation to continue the match. And Continuation contains code to continue the match from where Matcher left off. They both accepts State as an argument to know where to start matching.

Production Term :: Atom Quantifier

The production Term :: Atom Quantifier evaluates as follows:

  1. Evaluate Atom to obtain a Matcher m.
  2. Evaluate Quantifier to obtain the three results: an integer min, an integer (or ∞) max, and Boolean greedy.
  3. If max is finite and less than min, then throw a SyntaxError exception.
  4. Let parenIndex be the number of left capturing parentheses in the entire regular expression that occur to the left of this production expansion's Term. [...]
  5. Let parenCount be the number of left capturing parentheses in the expansion of this production's Atom. [...]
  6. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following:
    1. Call RepeatMatcher(m, min, max, greedy, x, c, parenIndex, parenCount) and return its result.

Take note that m is the Matcher for the Atom that is being repeated, and Continuation c is passed in from the code generated from higher level production rules.

The abstract operation RepeatMatcher takes eight parameters, a Matcher m, an integer min, an integer (or ∞) max, a Boolean greedy, a State x, a Continuation c, an integer parenIndex, and an integer parenCount, and performs the following:

  1. If max is zero, then call c(x) and return its result.
  2. Create an internal Continuation closure d that takes one State argument y and performs the following:
    1. If min is zero and y's endIndex is equal to x's endIndex, then return failure.
    2. If min is zero then let min2 be zero; otherwise let min2 be min - 1.
    3. If max is ∞, then let max2 be ∞; otherwise let max2 be max - 1.
    4. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.
  3. Let cap be a fresh copy of x's captures internal array.
  4. For every integer k that satisfies parenIndex < k and k ? parenIndex + parenCount, set cap[k] to undefined.
  5. Let e be x's endIndex.
  6. Let xr be the State (e, cap).
  7. If min is not zero, then call m(xr, d) and return its result.
  8. If greedy is false, then
    1. Call c(x) and let z be its result.
    2. If z is not failure, return z.
    3. Call m(xr, d) and return its result.
  9. Call m(xr, d) and let z be its result.
  10. If z is not failure, return z.
  11. Call c(x) and return its result.

Step 2 defines a Continuation d where we try to match another instance of the repeated Atom, which is later called in step 7 (min > 0 case), step 8.3 (lazy case) and step 9 (greedy case) via the Matcher m.

Step 5 and 6 creates a copy of the current State, for backtracking purpose, and to detect empty string match in step 2.

The steps from here on describes 3 separate cases:

  • Step 7 covers the case where the quantifier has non-zero min value. Greediness regardless, we need to match at least min instances of Atom, before we are allowed to call the Continuation c.

  • Due to the condition in step 7, min is 0 from this point on. Step 8 deals with the case where the quantifier is lazy. Step 9, 10, 11 deals with the case where the quantifier is greedy. Note the reversed order of calling.

Calling m(xr, d) means matching one instance of the Atom, then calling Continuation d to continue the repetition.

Calling Continuation c(x) means getting out of the repetition and matching whatever comes next. Note how Continuation c is passed to the RepeatMatcher inside Continuation d, so that it is available to all subsequent iteration of the repetition.

Explanation of the quirk in JavaScript RegExp

Step 2.1 is the culprit that causes the discrepancy in the result between PCRE and JavaScript.

  1. If min is zero and y's endIndex is equal to x's endIndex, then return failure.

The condition min = 0 is reached when the quantifier originally has 0 as min (* or {0,n}) or via step 7, which must be called as long as min > 0 (original quantifier is + or {n,} or {n,m}).

When min = 0 and the quantifier is greedy, Matcher m will be called (in step 9), which attempts to match an instance of Atom and call Continuation d that is set up in step 2. Continuation d will recursively call RepeatMatcher, which in turn will call Matcher m (step 9) again.

Without step 2.1, if Matcher m has empty string as its first possible choice to advance ahead in the input, the iteration will (theoretically) continue forever without advance ahead. Given the limited features that JavaScript RegExp supports (no forward-declared back-reference or other fancy features), there is no need to go ahead with another iteration when the current iteration matches empty string - an empty string is going to be matched again anyway. Step 2.1 is JavaScript's method of dealing with repetition of empty string.

As set up in step 2.1, when min = 0, the JavaScript regex engine will prune when an empty string is matched by the Matcher m (return failure). This effectively rejects "empty string repeated finitely many times is an empty string".

Another side effect of step 2.1 is that from the point when min = 0, as long as there is one instance where Matcher m matches non-empty string, the last repetition of Atom is guaranteed to be non-empty.

In contrast, it seems PCRE (and other engines) accepts "empty string repeated finitely many times is an empty string", and continue on with the rest of the pattern. This explains the behavior of PCRE in the cases listed above. As for the algorithm, PCRE stops the repetition after matching empty string twice in a row.

Comments