Steffen Roller Steffen Roller - 5 months ago 13
Perl Question

How to iterate over a multiline string with perl's regex

I need to extract several sections from a multiline string with Perl. I'm applying the same regex in a while loop.
My problem is to get the last section which ends with the file. My workaround is to append the marker. This way the regex will always find and end.
Is there a better way to do it?

Example file:

Header

==== /home/src/file1.c#1 ====
content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

==== /home/src/file2.c#1 ====
content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2


Perl script:

#!/usr/bin/env perl

my $desc = do { local $/ = undef; <> };

$desc .= "\n===="; # set the end marker

while($desc =~ /^==== (?<filename>.*?)#.*?====$(?<content>.*?)(?=^====)/mgsp) {
print "filename=", $+{filename}, "\n";
print "content=", $+{content}, "\n";
}


This way the script finds both segments. How can I avoid adding the marker?

Answer

Use of the greediness modifier ? is a giant red flag. You can usually get away with using it once in a pattern, but more than that is usually a bug. If you want to match text that doesn't contain a string, use the following instead:

(?:(?!STRING).)*

So that gets you the following:

/
   ^==== [ ] (?<filename> [^\n]+ ) [ ] ====\n
   (?<content> (?:(?! ^==== ).)* )
/xsmg

Code:

my $desc = do { local $/; <DATA> };

while (
   $desc =~ /
      ^==== [ ] (?<filename> [^\n]+ ) [ ] ====\n
      (?<content> (?:(?! ^==== ).)* )
   /xsmg
) {
   print "filename=<<$+{filename}>>\n";
   print "content=<<$+{content}>>\n";
}

__DATA__
Header

==== /home/src/file1.c#1 ====
content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

==== /home/src/file2.c#1 ====
content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2

Output:

filename=<</home/src/file1.c#1>>
content=<<content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

>>
filename=<</home/src/file2.c#1>>
content=<<content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2
>>