KMH KMH - 3 months ago 18
HTML Question

Need an xpath expression for extractin a particular node along with its two siblings (If they are there)

There is an anchor tag, sometimes followed by one or two span tags. I have to select anchor's href based upon an equality comparison with the text found in

  1. All three tags (achor, sibling span 1, and sibling span 2)

  2. Two tags (anchor, sibling 1)

  3. only in anchor tag

At any time one of the above will be true for a particular arrangement of anchor, sibling span 1 and sibling span 2. If text is found in whichever of the above arrangement of tags, I want that anchor tag's href for further processing.

Example: Consider the following HTML snippet

<table class="table table-striped" width="95%">
<td ><span class="badge">P</span>
<a href="/abc" title="Title of anchor">some text</a>
<span style="font-weight:600;color:#666">ABC</span>
<span style="font-weight:600;color:#666">DEF</span>

Now, I would like to get all the text from this arrangement of anchor, span and span i.e "some text ABC DEF", I will check if it contains my string which happens to be ABC DEF (Full String should be there in the text) and now time to get the href of the anchor as my string is there in the text.


I would recommend checking them individually, as the xpath could be very complicated and could even make your program slower.

Another tip would be to just create a selector with just the part you know contains the necessary information (if the whole document is big enought, this would help a lot):

from scrapy import Selector
sel = Selector(text=response.css('table.table').extract_first())
anchor_selector = sel.css('a')
anchor_text = anchor_selector.css('::text').extract_first()
span_siblings = anchor_selector.xpath('./following-sibling::span/text()').extract()
# now play with anchor_text and the list of span_siblings