Artturi Björk Artturi Björk - 1 month ago 13
Python Question

Getting scrapy to follow specific links on a page

I'm trying to scrape lyrics from The Original Hip Hop Lyrics Archive.

I've managed to write a spider that scrapes the lyrics of an artist if I release it on the artist page such as this: http://www.ohhla.com/anonymous/aesoprck/.

but when I release it on this page with links to different artist pages http://www.ohhla.com/all.html I get nothing.

This is the rule that I'm trying to use to follow the links to artist pages:

Rule(LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True)


and this is the rule I'm trying to use to follow the links to different pages with links to the artist pages:

Rule(LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True)


I modified the tutorial in Scrapy to get this to work since for some reason it didn't work when I started a new project.

Here is my complete working example of the spider:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor


class ohhlaSpider(CrawlSpider):
name = "ohhla"
download_delay = 0.5
allowed_domains = ["ohhla.com"]
start_urls = ["http://www.ohhla.com/anonymous/aesoprck/"]
rules = (Rule (LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True), # trying to follow links to pages with more links to artist pages
Rule (LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True), # trying to follow links to artist pages
Rule (LinkExtractor(deny_extensions=("txt"),restrict_xpaths=('//ul/li',)), follow= True), # succeeding in following links to album pages
Rule (LinkExtractor(restrict_xpaths=('//ul/li',)), callback="extract_text", follow= False),) # succeeding in extracting lyrics from the songs on album pages

def extract_text(self, response):
""" extract text from webpage"""
string = response.xpath('//pre/text()').extract()[0]
with open("lyrics.txt", 'wb') as f:
f.write(string)

Answer

restrict_xpaths should not point to the @href attribute. It should point to the place where the link extractor would search for links:

Rule(LinkExtractor(restrict_xpaths='//h3'), follow=True)

Note that you can specify it as a string instead of a tuple.


You can also allow all the links having all*.html in it:

Rule(LinkExtractor(allow=r'all.*?\.html'), follow=True)

You should also make sure your spider is actually visiting that "Parent Directory" page. Starting crawling with it sounds logical since this is an index page for the catalog:

start_urls = ["http://www.ohhla.com/all.html"]