AlexBrand AlexBrand - 10 months ago 62
Python Question

scrapy - parsing items that are paginated

I have a url of the form:

There are a total of 53 pages, each one of them has ~20 rows.

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

def parse(self, response):
hxs = HtmlXPathSelector(response)

restaurants ='//*[@id="contenido-resbus"]/table/tr[position()>1]')

for rest in restaurants:
item = DegustaItem()
item['name'] ='td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
item['category'] ='td[3]/a/text()').extract()[0]
item['category'] = ''
item['urbanization'] ='td[4]/a/text()').extract()[0]

# get profile url
rel_url ='td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)

request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request

def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item

The question is, how do I crawl each page?

Answer Source

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):
    start_urls = ['' % page for page in xrange(1,54)]