plotplot plotplot - 1 month ago 9
Python Question

Scrapy XPath selector

I am scraping this site and I'm using Scrapy as the means. However, I am having trouble with the XPath. I'm not entirely sure what is going on:

Why does this work:

def parse_item(self, response):
item = BotItem()

for title in response.xpath('//h1'):
item['title'] = title.xpath('strong/text()').extract()
item['wage'] = title.xpath('span[@class="price"]/text()').extract()
yield item


and the following code not?

def parse_item(self, response):
item = BotItem()

for title in response.xpath('//body'):
item['title'] = title.xpath('h1/strong/text()').extract()
item['wage'] = title.xpath('h1/span[@class="price"]/text()').extract()
yield item


I aim to also extract the XPath for:

//div[@id="description"]/p


But I can't because it is outside the
h1
node. How can I achieve this? My full code is:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bot.items import BotItem


class MufmufSpider(CrawlSpider):
name = 'mufmuf'
allowed_domains = ['mufmuf.ro']
start_urls = ['http://mufmuf.ro/locuri-de-munca/joburi-in-strainatate/']

rules = (
Rule(
LinkExtractor(restrict_xpaths='//div[@class="paginate"][position() = last()]'),
#callback='parse_start_url',
follow=True
),
Rule(
LinkExtractor(restrict_xpaths='//h3/a'),
callback='parse_item',
follow=True
),

def parse_item(self, response):
item = BotItem()

for title in response.xpath('//h1'):
item['title'] = title.xpath('strong/text()').extract()
item['wage'] = title.xpath('span[@class="price"]/text()').extract()
#item['description'] = title.xpath('div[@id="descirption"]/p/text()').extract()
yield item

Answer

The for title in response.xpath('//body'): option does not work because your XPath expressions in the loop make it search for h1 element directly inside the body element.

Moreover, since there is only one desired entity to extract you don't need a loop here at all:

def parse_item(self, response):
    item = BotItem()

    item["title"] = response.xpath('//h1/strong/text()').extract()
    item["wage"] = response.xpath('//h1/span[@class="price"]/text()').extract()
    item["description"] = response.xpath('//div[@id="description"]/p/text()').extract()

    return item

(this should also answer your second question about the description)