dtgee dtgee - 9 months ago 60
Python Question

XPath select image links - parent href link of img src only if it exists, else select img src link

I ran into a somewhat complicated XPath problem. Consider this HTML of part of a web page (I used Imgur and replaced some text):

<a href="//i.imgur.com/ahreflink.jpg" class="zoom">
<img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg"

I first want to search all
tags in the document and finding their corresponding
es. Next, I want to check if the
img src
link contains an image file extension (.jpeg, .jpg, .gif, .png). If it doesn't contain an image extension, don't grab it. In this case it has an image extension. Now we want to figure out which link we want to grab. Since the
parent href
exists, we should grab the corresponding link.

Desired Result:

But now let's say the
parent href
doesn't exist:

<a name="missing! oh no!">
<img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg"

Desired Result:

How do I go about constructing this XPath? If it helps, I am also using Python (Scrapy) with XPath. So if the problem needs to be separated out, Python can be used as well.


You don't have to do it in a single XPath expression. Here is a Scrapy specific implementation omitting the image extension check (judging by the comments, you've already figured that out):

images = response.xpath("//a/img")
for image in images:
    a_link = image.xpath("../@href").extract_first()
    image_link = image.xpath("@src").extract_first()

    print(a_link or image_link)