hjschmid hjschmid - 2 months ago 25
Python Question

Scrapy SgmlLinkExtractor how to define XPath

I want to retreive the cityname and citycode and store it in one string variable. The image shows the precise location:

enter image description here

Google Chrome gave me the following XPath:


So I defined the following statement in scrapy to get the desired information:

plz = response.xpath('//*[@id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span/text()').extract()

However I was not successful, the string remains empty. What XPath definition should I use instead?

Sam Sam

Most of the time this occurs, this is because browsers correct invalid HTML. How do you fix this? Inspect the (raw) HTML source and write your own XPath that navigate the DOM with the shortest/simplest query.

I scrape a lot of data off of the web and I've never used an XPath as specific as the one you got from the browser. This is for a few reasons:

  1. It will fail quickly on invalid HTML or the most basic of hierarchy changes.
  2. It contains no identifying data for debugging an issue when the website changes.
  3. It's way longer than it should be.

Here's an example (there are a lot of different XPath queries you could write to find this data, I'd suggest you learning and re-writing this query so there are common themes for XPath queries throughout your project) query for grabbing that element:

//div[contains(@class, "detail-address")]//h2/following-sibling::span

The other main source of this problem is sites that extensively rely on JS to modify what is shown on the screen. Conveniently, though, this would be debugged the same was as above. As soon as you glance at the HTML returned on page load, you would notice that the data you are querying doesn't exist until JS executes. At that point, you would need to do some sort of headless browsing.

Since my answer was essentially "write your own XPath" (rather than relying on the browser), I'll leave some sources: