So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.
CrawlSpider is very useful when crawling forums searching for posts for example, or categorized online stores when searching for product pages.
The idea is that "somehow" you have to go into each category, searching for links that correspond to product/item information you want to extract. Those product links are the ones specified on the second rule of that example (it says the ones that have
item.php in the url).
Now how should the spider keep visiting links until finding those containing
item.php? that's the first rule for. It says to visit every Link containing
category.php but not
subsection.php, which means it won't exactly extract any "item" from those links, but it defines the path of the spider to find the real items.
That's why you see it doesn't contain a
callback method inside the rule, as it won't return that link response for you to process, because it will be directly followed.