initWithStyle initWithStyle - 1 month ago 8
Python Question

How to use Python's HTMLParser to extract specific links

I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:

def handle_starttag(self, tag, attrs):
if tag == 'a':
for (key, value) in attrs:
if key == 'href':
newUrl = urljoin(self.baseUrl, value)
self.links = self.links + [newUrl]


This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links.

How would I go about only fetching links that are between the
<td class="title">
and
</td>
tags, like this:

<td class="title"><a href="http://www.stackoverflow.com">StackOverflow</a><span class="comhead"> (arstechnica.com) </span></td>

Answer

HTMLParser is a SAX-style or streaming parser, which means that you get pieces of the document as they are parsed, but not the whole document at once. The parser calls methods you provide to handle tags and other types of data. Any context you may be interested yourself, such as which tags are inside other tags, you must glean from the tags you see passing by.

For example, if you see a <td> tag, then you know you are in a table cell, and can set a flag to that effect. When you see </td>, you know you have left a table cell and can clear that flag. To get the links inside a table cell, then, if you see <a> and you know that you are in a table cell (because of that flag you set), you grab the value of the tag's href attribute if it has one.

from HTMLParser import HTMLParser

class LinkExctractor(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.extracting = False
        self.links      = []

    def handle_startag(self, tag, attrs):
        if tag == "td" or tag == "a":
            attrs = dict(attrs)   # save us from iterating over the attrs
        if tag == "td" and attrs.get("class", "") == "title":
            self.extracting = True
        elif tag == "a" and "href" in attrs and self.extracting:
            self.links.append(attrs["href"])

    def handle_endtag(self, tag):
        if tag == "td":
            self.extracting = False

This quickly gets to be a pain as you need more and more context to get what you want from the document, which is why people are recommending lxml and BeautifulSoup. These are DOM-style parsers that keep track of the document hierarchy for you and provide various friendly ways to navigate it, such as a DOM API, XPath, and/or CSS selectors.

BTW, I answered a similar question recently here.

Comments