user1754738 user1754738 - 7 months ago 21
Perl Question

Perl: Detect Newline in HTML

I don't have access to any modules like MoJo but need to capture all of the content between two H3 tags. Unfortunately, on some pages the H3 tags have newlines/carriage returns (not sure how I can tell which one) while others don't. I need some regexp to capture either. Here is the source code for both the scenarios I need to capture:

1st Scenario

<h3>Summary</h3>
<h3>Solution</h3>


2nd Scenario

<h3>Summary
</h3>
<h3>Solution
</h3>


My current code looks something like this:

if ($doc =~ m{<h3>Summary(?s:.)</h3>(.+?)<h3>Solution(?s:.)</h3>}si)
{
my $summaryp = $1;
$summaryp =~ s{<.+?>}{}gsi;
...
}


I've tried a number of variations on \n, \r, (.+?), \S\s, etc. without success in capturing scenario #2.

For thoroughness sake, I'm not sure if there's a space or two before the newline so I'll need something that accounts for any character, space or line changes.

Answer

Per comments on the question, m{<h3>Summary.*?</h3>(.*?)<h3>Solution.*?</h3>}si should do what is needed.

Here's the full example I tested with:

use warnings;
use strict;

my $doc1 = <<EOF;
<h3>Summary
</h3>
blah 1
this is some stuff
<h3>Solution
</h3>
EOF

my $doc2 = <<EOF2;
<h3>Summary</h3>
blah 2
this is more stuff
<h3>Solution</h3>
EOF2

for my $doc ($doc1, $doc2){
    if ($doc =~ m{<h3>Summary.*?</h3>(.*?)<h3>Solution.*?</h3>}si){
        print "$1\n";
    }
}

Output:

blah 1
this is some stuff


blah 2
this is more stuff