qwert-e qwert-e - 2 months ago 11
Python Question

Union of node and function on node in XPath

I am using Scrapy to crawl some webpages. I want to write an XPath query that will, within a parent

<div>
, append a couple of characters of text to any child
<a>
nodes, while extracting the text of the div's
self
node normally. Essentially it is like a normal
descendant-or-self
or
//
query, just written with
|
and calling the
concat
function on the descendants (which, if they exist, will be
<a>
tags).

These all return a value:


  1. my_div.xpath('div[@class="my_class"]/text()).extract()

  2. my_div.xpath('concat(\'@\', div[@class="my_class"]/a/text())').extract()

  3. my_div.xpath('div[@class="my_class"]/text() | div[@class="my_class"]/a/text()').extract()



However attempting to combine (1) and (2) above in the format of (3):

my_div.xpath('div[@class="my_class"]/text() |
concat(\'@\', div[@class="my_class"]/a/text())').extract()


results in the following error:

ValueError: XPath error: Invalid type in div[@class="my_class"]/text() | concat('@', div[@class="my_class"]/a/text())


How do I get XPath to recognize the union of a node with a function called on a node?

Answer

I think it doesn't work because concat is doesn't actually return a path, and | is used to select multiple paths

By using the | operator in an XPath expression you can select several paths.

as per http://www.w3schools.com/xsl/xpath_syntax.asp

Why not just split it into two? Generally you use ItemLoaders with your spider. So you can simply add as many paths and/or values as you like.

mil = MyItemLoader(response=response)
mil.add_xpath('name', 'xpath1')
mil.add_xpath('name', 'xpath2')
mil.load_item()
# {'name': ['values_of_xpath1','values_of_xpath2']

If you want to preserve tree order you can try:

nodes = my_div.xpath('div[@class="my_class"]')
text = []
for node in nodes:
    text.append(node.xpath("text()").extract_first())
    text.append(node.xpath("a/text()").extract_first())
text = '@'.join(text)

You can probably simplify it with list comprehension but you get the idea: extract the nodes and iterate through nodes for both values.