Adrian Adrian - 6 months ago 9
Perl Question

How can I access the value in a named capture group in a regex in perl?

I'm trying to access the captured data that was captured in a named capture group called as a subroutine:

use strict;
use warnings;
"this is a test" =~ /(?!)
(?<isa>is\s+a)
| (?&isa)\s
(?<test>test)/x;
print "isa: $+{isa}\ntest: $+{test}"


And here's another attempt:

use strict;
use warnings;
"this is a test" =~ /(?!)
(?<isa_>(?<isa>is\s+a))
| (?&isa_)\s
(?<test>test)/x;
print "isa: $+{isa}\ntest: $+{test}"


I can't seem to get $+{isa} to be populated. Why is that and how do I do so?

Answer

Since you forces the first branch to fail with (?!), the named capture group (?<isa>...) that is defined after doesn't capture anything (but is defined as a subpattern).

Only the second branch succeeds, but this one doesn't capture anything for the group "isa", it only uses the subpattern alias (?&isa_).

Your first example returns the warning:

Reference to nonexistent named group in regex

since "isa_" is defined nowhere.

Your second example will not populate "isa" too, because the capture groups captures things only where they are defined, not elsewhere (even if isa_ refers to the group isa.)

The reason is that Perl doesn't store captures in a recursion (only captures at the ground level are kept). You can test it with this example:

"this is a test" =~ /
  (?!)
  (?<isa_>
      (?<isa> is \s+ a)
      (?{print "isa in recursion: $+{isa}\n"})
  )
|
  (?&isa_) \s (?<test> test )
/x;

print "isa: $+{isa}\ntest: $+{test}"

Other regex engines are able to store captures in a recursion like the .net regex engine or the Python regex module, but not Perl nor PCRE.


However, you can write:

"this is a test" =~ /
  (?!) (?<isa_> is \s+ a )
|
  (?<isa> (?&isa_) ) \s (?<test> test )
/x;

print "isa: $+{isa}\ntest: $+{test}";

But here, the named capture "isa" is at the ground level.


Note: instead of using (?!) to make the pattern fail and an alternation, you can use the (?(DEFINE)...) syntax:

/(?(DEFINE)
     (?<isa_> (?<isa> is \s+ a) )
 )
 (?&isa_) \s (?<test> test )
/x

or this one:

/(?<isa_> (?<isa> is \s+ a) ){0}
 (?&isa_) \s (?<test> test )
/x

In this way you avoid the cost of an alternation.