AlexBrand AlexBrand - 1 month ago 16
Python Question

scrapy - parsing items that are paginated

I have a url of the form:

example.com/foo/bar/page_1.html


There are a total of 53 pages, each one of them has ~20 rows.

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

def parse(self, response):
hxs = HtmlXPathSelector(response)

restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)

request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request


def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item


The question is, how do I crawl each page?

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

Answer

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]