atefth atefth - 29 days ago 15
Perl Question

Parse Html Audio Tag Using HTML::TokeParser


I am trying to write a spider in perl which will parse all audio tags in a domain and attempt to download the respective
audio/mpeg
content from each audio tag found.


Below is a snippet from my code which uses the
HTML::TokeParser
to parse html in order to extract links from
a
tags:

my($response, $base, $stream, $pageURL, $tag, $url);

$response = 'http://example.com/page-with-some-audio-content';
$base = URI->new( $response->base )->canonical;

$stream = HTML::TokeParser->new( $response->content_ref );
$pageURL = URI->new( $response->request->uri );

while($tag = $stream->get_tag('a')) {
next unless defined($url = $tag->[1]{'href'});
print $url."\n";
}



The above code snippet extracts all links from a given html page. This is used in a loop along with a hash of urls to crawl all pages in a given domain.


Below is another snippet almost entirely the same as the first except that I'm trying to extract
audio
tags
instead of
a
tags:

my($response, $base, $stream, $pageURL, $tag, $url);

$response = 'http://example.com/page-with-some-audio-content';
$base = URI->new( $response->base )->canonical;

$stream = HTML::TokeParser->new( $response->content_ref );
$pageURL = URI->new( $response->request->uri );

while($tag = $stream->get_tag('audio')) {
next unless defined($url = $tag->[1]{'onplaying'});
print $url."\n";
}


For some reason, no
audio
tags are being detected. Is there something that I'm missing?





Reading the HTML::TokeParser documentation I figure that I can not extract attributes of nested html elements.


Consider this markup below:

<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File.mp3">
</audio>


I want to parse the entire html to extract only the
src
attributes of all
audio
tags found. Hence, if the html looked something like this:

<body>

<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File.mp3">
</audio>

<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 2.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%202.mp3">
</audio>

<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 3.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%203.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 4.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%204.mp3">
</audio>

</body>


The expected output should be like this:

http://www.example.com/mp3/Some%20Mp3%20File.mp3

http://www.example.com/mp3/Some%20Mp3%20File%202.mp3

http://www.example.com/mp3/Some%20Mp3%20File%203.mp3

http://www.example.com/mp3/Some%20Mp3%20File%204.mp3



So I need to parse html files to extract only the
src
attributes of each
audio
tag present.

Answer

I'm not familiar with HTML::Token but Mojo::DOM from Mojolicious can be used to easily find and extract the links with a familiar CSS syntax:

use Mojo::DOM;
my $html = '<body> ... ';
my $dom = Mojo::DOM->new($html);
my @src = map { $_->{src} }
    $dom->find('audio[onplaying] source[src]')->each;

And you can also combine this with Mojo::UserAgent if you need to grab the HTML files or the audio files from the network.

Comments