Tupak Goliam Tupak Goliam - 10 months ago 152
Python Question

Scrapy - Follow RSS links

I was wondering if anyone ever tried to extract/follow RSS item links using
SgmlLinkExtractor/CrawlSpider. I can't get it to work...

I am using the following rule:

rules = (
Rule(SgmlLinkExtractor(tags=('link',), attrs=False),

(having in mind that rss links are located in the link tag).

I am not sure how to tell SgmlLinkExtractor to extract the text() of
the link and not to search the attributes ...

Any help is welcome,
Thanks in advance


CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector

class MySpider(BaseSpider):
    name = 'myspider'

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        links = xxs.select("//link/text()").extract()
        return [Request(x, callback=self.parse_link) for x in links]

You can also try the XPath in the shell, by running for example:

scrapy shell http://blog.scrapy.org/rss.xml

And then typing in the shell:

>>> xxs.select("//link/text()").extract()