BrianB BrianB - 1 month ago 5x
Perl Question

How to Parse a webpage

I am attempting to extract the following from the EnviroCanada weather page.

I am trying to get for each hour as per the following.

Time | Thigh | Tlow | Humidity

7:00 | 23 | 22.9 | 30

Extracted HTML Page:

<td headers="header1" class="text-center vertical-center"> 7:00 </td>
<td headers="header2" class="media vertical-center"><span class="pull-left"><img class="media-object" height="35" width="35" src="/weathericons/small/02.png" /></span><div class="visible-xs visible-sm">
<br />
<br />
<div class="media-body">
<p>Partly Cloudy</p>
<td headers="header3m" class=" metricData text-center vertical-center">23
<td headers="header3i" class=" imperialData hidden text-center vertical-center">73
<td headers="header4m" class="metricData text-center vertical-center">
<abbr title="West-Northwest">WNW</abbr> 8</td>
<td headers="header4i" class="imperialData hidden text-center vertical-center">
<abbr title="West-Northwest">WNW</abbr> 5</td>
<td headers="header6" class="metricData text-center vertical-center">30</td>
<td headers="header6" class="imperialData hidden text-center vertical-center">87</td>
<td headers="header7" class="text-center vertical-center">83</td>
<td headers="header8" class="metricData text-center vertical-center">20</td>
<td headers="header8" class="imperialData hidden text-center vertical-center">68</td>
<td headers="header9m" class="metricData text-center vertical-center">100.7</td>
<td headers="header9i" class="imperialData hidden text-center vertical-center">29.7</td>
<td headers="header10" class="metricData text-center vertical-center">24</td>
<td headers="header10" class="imperialData hidden text-center vertical-center">15</td>

Code so far:

use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;

my $url = "";
my $page = get($url) ||
die "Could not load URL\n";

my $parser = HTML::TokeParser->new(\$page) ||
die "Parse error\n";

$parser->get_tag("td") foreach ();
my $time = $parser->get_text();

my $thigh = $parser->get_text();

my $tlow = $parser->get_text();

my $humid = $parser->get_text();

I'm Completely lost here


Once you fetch the page with LWP::Simple, you can pick a specific tool depending on what needs to be done with it, instead of using a general parser.

In this case you have a table on your hands and I'd recommend HTML::TableExtract. With it you can cleanly retrieve table elements in a number of ways and then process them. One can work with multiple tables, make use of headers, set up parsing preferences, and more. Most of the time you don't have to even look at the actual HTML. This is a subclass of HTML::Parser. In my experience it's been a very good tool.

Here is some basic code, for this particular page and task.

use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $url = "";
my $page = get($url) or die "Can't load $url: $!";

my $headers = [ 'Time', 'Temperature', 'Humidex' ];

my $tec = HTML::TableExtract->new(headers => $headers);

my $fmt = "%6s | %6s | %6s | %8s\n";    
printf($fmt, 'Time', 'T-high', 'T-low', 'Humidex');    

my ($time, $temp_hi, $temp_low, $hum);

foreach my $row ($tec->rows) {
    # Skip rows without expected data. Clean up leading/trailing spaces.
    next if $row->[0] !~ /^\s*\d?\d:\d\d/;
    my @row = map { s/^\s*//; s/\s*$//; $_ } @$row; # /
    # Process as needed
    ($time, $hum) = @row[0,2];
    ($temp_hi, $temp_low) = $row[1] =~ /(\d+) .* \( (\d+\.\d+) \)/xs;
    printf($fmt, $time, $temp_hi, $temp_low, $hum);

The first few rows of output

  Time | T-high |  T-low |  Humidex
 16:00 |     29 |   29.2 |       37
 15:00 |     27 |   27.2 |       37
 14:00 |     26 |   25.6 |       33