Andrew Newby Andrew Newby - 6 months ago 17
Perl Question

Perl regex seems to get into infinite loop

I'm trying to figure out why this code won't run on some sites. Here is a working version:

my $url = "http://www.bbc.co.uk/news/uk-36263685";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
$html = join "\n", <READPAGE>;
close(READPAGE);

# works ok with the BBC page, and almost all others
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
print qq|FOO: got header...\n|;
}


..and then this broken version , just seems to lock up: (exactly the same code - just a different URL)

my $url = "http://www.sport.pl/euro2016/1,136510,20049098,euro-2016-polsat-odkryl-karty-24-mecze-w-kanalach-otwartych.html";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
$html = join "\n", <READPAGE>;
close(READPAGE);

# Locks up with this regex. Just seems to be some pages it does it on
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
print qq|FOO: got header...\n|;
}


I can't work out whats going on with it. Any ideas?

Thanks!

Answer

It isn't an inifinite loop, it is just slow. It is finding <header> tags too, and for each one it has to go through the rest of the file looking for an ending </head> tag (which isn't there). Change it to:

`m/<head\b.*?>(.*?)<\/head>/gis`

The problem seems exacerbated by treating the non-utf8 file as utf8.

Comments