Tim Angus - 2 months ago 13

Perl Question

Why does the following regex capture (via the capturing group) the string 'abc' in Javascript, but not in PCRE (although it will still match)?

`(.*)*`

Answer

Here's why the capture group is empty in PCRE:

Initial state

`(.*)* abc ^ ^`

First, the

`(.*)`

part is matched against`abc`

, and the input position is advanced to the end. The capture group contains`abc`

at this point.`(.*)* abc ^ ^`

Now, the input position is

*after*the`c`

character, the remaining input is the empty string. The Kleene star initiates a second attempt at matching the`(.*)`

group:`(.*)* abc ^ ^`

The

`(.*)`

group matches the*empty string*after`abc`

. Since it matched,**the previously captured string is overwritten**.Since the input position hasn't advanced, the

`*`

ends iterating there and the match succeeds.

The behavior difference between JS and PCRE is due to the way the regex engines are specified. PCRE's behavior is consistent with Perl:

PCRE:

```
$ pcretest
PCRE version 8.39 2016-06-14
re> /(.*)*/
data> abc
0: abc
1:
```

Perl:

```
$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";'
<abc> <>
```

Let's compare this with .NET, which has the same behavior, but *supports multiple captures*:

When a capture group is matched for a second time, .NET will add the captured value to a *capture stack*. Perl and PCRE will simply overwrite it.

As for JavaScript:

Here's ECMA-262 §21.2.2.5.1 Runtime Semantics: RepeatMatcher Abstract Operation:

The abstract operation

RepeatMatchertakes eight parameters, a Matcher`m`

, an integer`min`

, an integer (or ∞)`max`

, a Boolean`greedy`

, a State`x`

, a Continuation`c`

, an integer`parenIndex`

, and an integer`parenCount`

, and performs the following steps:

- If
`max`

is zero, return`c(x)`

.- Create an internal Continuation closure
`d`

that takes one State argument`y`

and performs the following steps when evaluated:

- a. If
`min`

is zero and`y`

's`endIndex`

is equal to`x`

's`endIndex`

, return`failure`

.- b. If
`min`

is zero, let`min2`

be zero; otherwise let`min2`

be`min‑1`

.- c. If
`max`

is ∞, let`max2`

be ∞; otherwise let`max2`

be`max‑1`

.- d. Call
`RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)`

and return its result.- Let
`cap`

be a fresh copy of`x`

's captures List.- For every integer
`k`

that satisfies`parenIndex < k`

and`k ≤ parenIndex+parenCount`

, set`cap[k]`

to`undefined`

.- Let
`e`

be`x`

's endIndex.- Let
`xr`

be the State`(e, cap)`

.- If
`min`

is not zero, return`m(xr, d)`

.- If
`greedy`

is`false`

, then

- a. Call
`c(x)`

and let`z`

be its result.- b. If
`z`

is not`failure`

, return`z`

.- c. Call
`m(xr, d)`

and return its result.- Call
`m(xr, d)`

and let`z`

be its result.- If
`z`

is not`failure`

, return`z`

.- Call
`c(x)`

and return its result.

This is basically the definition of what's supposed to be happening when a quantifier is evaluated. `RepeatMatcher`

is the operation handling the matching of an inner operation `m`

.

You'll also need to understand what a *State* is (§21.2.2.1, emphasis mine):

A State is an ordered pair (

`endIndex`

,`captures`

) where`endIndex`

is an integer and captures is a List of`NcapturingParens`

values. States are used to represent partial match states in the regular expression matching algorithms.The, while`endIndex`

is one plus the index of the last input character matched so far by the pattern`captures`

holds the results of capturing parentheses. The`n`

th element of`captures`

is either a List that represents the value obtained by the`n`

th set of capturing parentheses or undefined if the`n`

th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.

For your example, the `RepeatMatcher`

parameters are:

`m`

: the*Matcher*responsible for handling the subpattern`(.*)`

`min`

: 0 (minimum number of matches for the Kleene star quantifier)`max`

: ∞ (maximum number of matches for the Kleene star quantifier)`greedy`

: true (the Kleene star quantifier used is greedy)`x`

:`(0, [undefined])`

(see the state definition above)`c`

: The continuation, at this point it'll be:*a Continuation that always returns its State argument as a successful*, after collapsing the parent rules`MatchResult`

`parenIndex`

: 0 (as per §21.2.2.5 this is the*number of left capturing parentheses in the entire regular expression that occur to the left of this production*)`parenCount`

: 1 (same spec paragraph, this is*the number of left capturing parentheses in the expansion of this production's Atom*- I won't paste the full spec here but this basically means that`m`

defines one capture group)

The regex matching algorithm in the spec is defined in terms of continuation-passing style. Basically, this means that the `c`

operation means *what should happen next*.

Let's unroll this algorithm.

On the first pass, the `x`

_{1} state is `(0, [undefined])`

.

`max`

is not zero- We create the continuation closure
`d`

_{1}at this point, it'll be used in the second pass so we'll come back to this one later. - Make a copy
`cap`

_{1}of the capture list - Reset the capture in
`cap`

_{1}to`undefined`

, this is a no-op in the fist pass - Let
`e`

_{1}= 0 - Let
`xr`

_{1}= (`e`

_{1},`cap`

_{1}) `min`

is zero, skip this step`greedy`

is true, skip this step- Let
`z`

_{1}=`m`

(`xr`

,`d`

_{1}) - this evaluates the subpattern`(.*)`

Now let's step back a bit - `m`

will match `(.*)`

against `abc`

, and then call our `d`

_{1} closure, so let's unroll that one.

`d`

_{1} is evaluated with the state `y`

_{1} =`(3, ["abc"])`

:

`min`

is 0, but`y`

_{1}'s`endIndex`

is 3 and`x`

_{1}'s`endIndex`

is 0, so don't return`failure`

- Let
`min2`

=`min`

= 0 since`min`

= 0 - Let
`max2`

=`max`

= ∞ since`max`

= ∞ - Call
`RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)`

and return its result. That is: RepeatMatcher(m, 0, ∞, false, y_{1}, c, 0, 1)

So, right now we're going for a second iteration with `x`

_{2} = `y`

_{1} = `(3, ["abc"])`

.

`max`

is not zero- We create the continuation closure
`d`

_{2}at this point - Make a copy
`cap`

_{2}of the capture list`["abc"]`

- Reset the capture in
`cap`

_{2}to`undefined`

, we get`cap`

_{2}=`[undefined]`

- Let
`e`

_{2}= 3 - Let
`xr`

_{2}= (`e`

_{2},`cap`

_{2}) `min`

is zero, skip this step`greedy`

is true, skip this stepLet

`z`

_{2}=`m`

(`xr`

_{2},`d`

_{2}) - this evaluates the subpattern`(.*)`

This time

`m`

will match the empty string after`abc`

, and call our`d`

_{2}closure with that one. Let's evaluate what`d`

_{2}does. It's parameter is`y`

_{2}=`(3, [""])`

`min`

is still 0, but`y`

_{2}'s`endIndex`

is 3 and`x`

_{2}'s`endIndex`

is also 3 (remember that this time`x`

is the`y`

state from the previous iteration), the closure simply returns`failure`

.`z`

_{2}is`failure`

, skip this step- return
`c`

(`x`

_{2}), which is`c((3, ["abc"]))`

in this iteration.

`c`

simply returns a valid MatchResult here as we're at the end of the pattern. Which means that `d`

_{1} returns this result, and the first iteration returns passes it along from step 10.

Phew. Basically, as you can see, the spec line which causes the JS behavior to be different than PCRE's is basically the following one:

a. If

`min`

is zero and`y`

's`endIndex`

is equal to`x`

's`endIndex`

, return`failure`

.