TrueY TrueY -4 years ago 181
Perl Question

How can I detect how many capture groups are in a Perl Regexp?

I have a bunch of s in a script. I would like to know how many capture groups are in them. More precisely I'd like to know how many items would be added to the @- and @+ arrays if they matched before actually use them in a real match op.

An example:

'XXAB(CD)DE\FG\XX' =~ /(?i)x(ab)\(cd\)(?:de)\\(fg\\)x/
and print "'@-', '@+'\n";

In this case the output is:

'1 2 11', '15 4 14'

So after matching I know that the 0th item is the matched part of the string, and there are two capture group expressions. Would it be possible to know right before the actual match?

I tried to concentrate onto the opening brackets. So I removed the '\\' patterns first to make easier to detect the escaped brackets. Then I removed '\(' strings. Then came '(?'. Now I can count the remaining opening brackets.

my $re = '(?i)x(ab)\(cd\)(?:de)\\\\(fg\\\\)x'; print "ORIG: '$re'\n";
'XXAB(CD)DE\FG\XX' =~ /$re/ and print "RE: '@-', '@+'\n";
$re =~ s/\\\\//g; print "\\\\: '$re'\n";
$re =~ s/\\\(//g; print "\\(: '$re'\n";
$re =~ s/\(\?//g; print "\\?: '$re'\n";
my $n = ($re =~ s/\(//g); print "n=$n\n";


ORIG: '(?i)x(ab)\(cd\)(?:de)\\(fg\\)x'
RE: '1 2 11', '15 4 14'
\\: '(?i)x(ab)\(cd\)(?:de)(fg)x'
\(: '(?i)x(ab)cd\)(?:de)(fg)x'
\?: 'i)x(ab)cd\):de)(fg)x'

So here I know that 2 capture groups are in this . But maybe there is an easier way and this is definitely not complete (e.g. this treats
as a non-caputre groups).

Another way would be to dump the internal data structures of
function. Maybe the package Regexp::Debugger could solve the issue, but I have no right to install packages in my environment.

Actually the s are keys to some ARRAY refs and I'd like to check if the referenced ARRAY contains the proper amount of values before actually applying the s. Of course this checking can be done right after the pattern matching, but it would be nicer if I could do it in the loading stage of the script.

Thank you for your help and comments in advance!

Answer Source

As Mr. Obama said: "Yes We Can!"

I found a solution which does not require additional module and handles all the possible capturing group events (as I know). As ikegami mentioned it needs reparsing of the regexp, but does it for us.

During digging in the haystack of Perl modules on CPAN I found a very interesting one called warnings::regex::recompile. It generates a warning message each time a regexp is re-compiled. Analyzing the source I found the solution of my problem.

Using use re qw/Debug DUMP/; Perl returns the parsed regex to STDERR. In the original module result is dumped to a real file and then reread to process. I modified the code to use in-memory file.

My solution is:

sub dumpre {
  use re qw(eval Debug DUMP);
  my $buf = ''; 

  open OLDERR, '>&', STDERR or die "$!";
  close STDERR or die "$!";
  open STDERR, '>', \$buf or die "$!";

  my $re = qr/$_[0]/;

  close STDERR or die "$!";
  open STDERR, '>&', OLDERR or die "$!";
  close OLDERR or die "$!";

  no re 'debug'; # Needed because of split

  return [ split '\n', $buf ];

This function turns on DUMP when compiling a regexp. Enables eval to be able to handle (?{...}) and (??{...}) expressions.

my $re = 'aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+';
my $r = dumpre $re;
print join "\n", @$r;

The result is:

Compiling REx "aa(?:(a\d)+x)?((b\d)*d)*c*(d\d)?(e*)((f)+)(g)+"
Final program:
   1: EXACT <aa> (3)
   3: CURLYX[0] {0,1} (19)
   5:   CURLYM[1] {1,32767} (16)
   9:     EXACT <a> (11)
  11:     POSIXU[\d] (14)
  14:     SUCCEED (0)
  15:   NOTHING (16)
  16:   EXACT <x> (18)
  18: WHILEM (0)
  19: NOTHING (20)
  20: CURLYX[1] {0,32767} (40)
  22:   OPEN2 (24)
  24:     CURLYM[3] {0,32767} (35)
  28:       EXACT <b> (30)
  30:       POSIXU[\d] (33)
  33:       SUCCEED (0)
  34:     NOTHING (35)
  35:     EXACT <d> (37)
  37:   CLOSE2 (39)
  39: WHILEM[1/7] (0)
  40: NOTHING (41)
  41: STAR (44)
  42:   EXACT <c> (0)
  44: CURLYM[4] {0,1} (55)
  48:   EXACT <d> (50)
  50:   POSIXU[\d] (53)
  53:   SUCCEED (0)
  54: NOTHING (55)
  55: OPEN5 (57)
  57:   STAR (60)
  58:     EXACT <e> (0)
  60: CLOSE5 (62)
  62: OPEN6 (64)
  64:   CURLYN[7] {1,32767} (74)
  66:     NOTHING (68)
  68:     EXACT <f> (0)
  72:   WHILEM (0)
  73:   NOTHING (74)
  74: CLOSE6 (76)
  76: CURLYN[8] {1,32767} (86)
  78:   NOTHING (80)
  80:   EXACT <g> (0)
  84: WHILEM (0)
  85: NOTHING (86)
  86: END (0)
anchored "aa" at 0 floating "fg" at 2..9223372036854775807 (checking floating) minlen 4 

So the lines with OPEN\d+, CURLYM[\d+], CURLYN[\d+] shows capturing bracket expressions (Line syntax: segment_no: regex command (next segment)). (Note: CURLYX is a non-capturing bracket expression like (?:...)+). The number after OPEN/CURLY[MN} shows the ordinal number of capturing group. The last has to be found. Which is 8 in this case.

Unfortunately it does not handle if the (??{...}) returns a bracket expression, but this is not really needed for me now. I suppose the format is not fixed, so it can differ from version to version. But it is ok for me.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download