I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:
def handle_starttag(self, tag, attrs):
if tag == 'a':
for (key, value) in attrs:
if key == 'href':
newUrl = urljoin(self.baseUrl, value)
self.links = self.links + [newUrl]
<td class="title"><a href="http://www.stackoverflow.com">StackOverflow</a><span class="comhead"> (arstechnica.com) </span></td>
HTMLParser is a SAX-style or streaming parser, which means that you get pieces of the document as they are parsed, but not the whole document at once. The parser calls methods you provide to handle tags and other types of data. Any context you may be interested yourself, such as which tags are inside other tags, you must glean from the tags you see passing by.
For example, if you see a
<td> tag, then you know you are in a table cell, and can set a flag to that effect. When you see
</td>, you know you have left a table cell and can clear that flag. To get the links inside a table cell, then, if you see
<a> and you know that you are in a table cell (because of that flag you set), you grab the value of the tag's
href attribute if it has one.
from HTMLParser import HTMLParser class LinkExctractor(HTMLParser): def reset(self): HTMLParser.reset(self) self.extracting = False self.links =  def handle_startag(self, tag, attrs): if tag == "td" or tag == "a": attrs = dict(attrs) # save us from iterating over the attrs if tag == "td" and attrs.get("class", "") == "title": self.extracting = True elif tag == "a" and "href" in attrs and self.extracting: self.links.append(attrs["href"]) def handle_endtag(self, tag): if tag == "td": self.extracting = False
This quickly gets to be a pain as you need more and more context to get what you want from the document, which is why people are recommending
BeautifulSoup. These are DOM-style parsers that keep track of the document hierarchy for you and provide various friendly ways to navigate it, such as a DOM API, XPath, and/or CSS selectors.
BTW, I answered a similar question recently here.