cajwine cajwine - 2 months ago 10
Perl Question

Regex for matching indented continuation lines

Need match

key = value
pairs in arbitrary text using the following rules.


  • the leading line has a structure:


    • start with indentation - "two spaces or tab" at leas once, e.g.:
      ( |\t)+

    • the
      +
      character and one space

    • words
      VAR
      or
      CONST

    • and the
      key
      and
      value
      using the
      =
      character




Examples:

+ VAR somename = somevalue (indented with two spaces)
+ VAR name3 = indented by one \t


The following regex matches such lines:

/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/


Now the problem: The syntax allows continuation lines, e.g. when the above line is followed by the line which starts at least one indentation sequence
( |\t)
(aka TWO spaces or one tab) is considered as an continuation line and its whole content (with leading spaces too) should be the
value
for the key in previous line.

Example:

+ VAR multi = 3 line value where the continuation lines
are indented (starts with two spaces or one tab)
and NOT followed by the '+'


e.g., the regex for the continuation line is

/^( |\t)+([^\+](.*))$/


The solution is easy with line based approach, e.g. when I split the whole text into lines and processes it line-by-line.

But, I looking for an (complex) regex (mainly for learning and benchmarking purposes) which could match the key=value pairs in one line or multiline form. Tried this:

while( $text =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)*)/gm ) {
...
}


but I got:

(?=( |\t)+[^\+](.*)$)* matches null string many times in regex; marked by <-- HERE in m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)* <-- HERE )/ at so line 36.


Side question: how to use the multi-line extended regexes, like:

/
^( |\t)+ # <- space ... :(
\+\s+
(VAR|CONST)
\s+
(\w+)
\s*=\s*
(.*)$
/x


when the regex must contain exactly the SPACE character (e.g. can't use the universal
\s
)?

If someone want help, here is a code which produces the wanted output (using the line-based approach) and also the non-working
regex-based
solution.

#!/usr/bin/env perl
use 5.014;
use warnings;
use Data::Dumper;

my $txt = do { local $/; <DATA> };

my @matches1 = parse_by_lines($txt // '');
mydump('BY LINES', @matches1);

my @matches2 = parse_by_one_regex($txt // '');
mydump('REGEX', @matches2);

sub parse_by_lines { #produces the wanted output
my ($text) = @_;
my @match;
my $havekey;
for my $line (split "\n", $text) {
if( $line =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/ ) {
push @match, { indent => $1, type => $2, key => $3, val => $4 };
$havekey++;
}
elsif( $havekey && $line =~ m/^( |\t)+([^\+](.*))$/ ) { #continuation line
$match[-1]->{val} .= "\n$line"; #prserve the \n in the val
}
else {
$havekey = 0;
}
}
return @match;
}


sub parse_by_one_regex { #not working
my ($text) = @_;
my @match;
while( $text =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)*)/gm ) {
push @match, { indent => $1, type => $2, key => $3, val => $4 };
}
return @match;
}

sub mydump {
my($label, @match) = @_;
say "#### $label ####";
for my $m ( @match ) {
printf "%-6s: [%s]\n", $_, $m->{$_} for (qw(indent type key val));
print "\n";
}
}

__DATA__
some arbitrary text lines
or empty lines

could be indented
and could contain any character

+ VAR name1 = var indented by two spaces and the first nonspace character is '+'
line of arbitrary text
+ VAR name2 = var indented by 2x2 spaces

+ VAR name3 = var indented by one \t
+ VAR name4 = the next line with "name5" is not valid. missing the = character, should not be matched
+ VAR name5
+ CONST name6 = the type could be VAR or CONST

+ VAR multi1 = multiline value where the continuation lines
are indented (starts with two spaces or one tab) and NOT followed by the '+'

+ VAR multi1 = multiline value
indented

+ VAR multi1 = multiline value
indented ok too


+ VAR single = this is single line
+ because this line even if it is indented, the first nonspace character is '+'

+ VAR multi2 = multiline
could be
indented
any way
and any number of times
until the first non-indented line

the following should NOT match

+ VAR some = sould not be matched, because the line isn't indented
+ VAR some = sould not be matched, because the line isn't indented at least with TWO spaces or one tab
+ SOME name = value not matched because the SOME isn't VAR or CONST





EDIT: using the accepted answer, and adding the wanted capture groups, got the following:

while( $text =~ /
(?m)
^
([ ]{2}|\t)+ # two spaces or one tab at least once (captured)
(VAR|CONST) # the type declaration (captured)
\s+ # separated by whitespace
(\w+) # keyword name (captured)
\s*=\s* # the = character surrounded by any number of spaces
( # capture the values whole as it is
.* # anything up to line end
(?: # followed by continuation lines
\R # one line-break
^ # start of the line
(?:[ ]{2,}|\t)+ # at least two spaces or one tab character
[^+] # not the +
.* # and anything up to end
)* # any number of times (e.g. optionally)
)/x ) {
push @match, { indent => $1, type => $2, key => $3, val => $4 };
}

Answer

Regex:

(?m)^(?:  +|\t+)\+ *(?:VAR|CONST) *\w* *=.*(?:\R^(?>  +|\t+)[^+\s].*)*

Live demo

The important part is last cluster:

(?:                # Start of non-capturing group (a)
    \R             # One line-break
    ^              # Start of line
    (?>  +|\t+)    # At least two spaces or one tab character (possessively)
    [^+\s]         # Not followed by `+` or a newline character
    .*             # Up to end of line
)*                 # Repeat it as much as possible - end of non-capturing group (a)

Answer to your second question:

Literal space characters are simply ignored as a meaningful part of Regular Expression while x modifier is set unless you enclose it in character classes [ ] and use quantifiers [ ]{2,} to express times they should appear.

/
    (?m)
    ^
    (?:
        [ ]{2,}
        |
        \t+
    )\+
    [ ]*
    (?:
        VAR
        |
        CONST
    )
    [ ]*\w*[ ]*=.*
    (?:
        \R
        ^
        (?>
            [ ]{2,}
            |
            \t+
        )
        [^+\s].*
    )*
/x

Live demo