K.K. K.K. - 3 months ago 26
Python Question

lxml XPath - extracting text from multi p nodes

Please have a look at lxml XPath position() does not work first.



Since XPath does not support to extract text from multi nodes, I decided to write for loop to get 30 stuffs.

for i in range(1,31):
content = "string(//div[@id='article']/p[" + (print(i)) + "]/.)"
print(content)


I imagined it would return like,

"string(//div[@id='article']/p[1]/.)"
"string(//div[@id='article']/p[2]/.)"
"string(//div[@id='article']/p[3]/.)"
....
"string(//div[@id='article']/p[30]/.)"


However, obviously it does not work as I expected.. I got error message as following.

TypeError: Can't convert 'NoneType' object to str implicitly


What should I do? Any other elegant approach to solve this problem?

Answer

In Python3, print is a function which prints to the screen and returns None. (In Python2, print is a statement and the code would have raised an error since you can't put a statement in the middle of an expression.) Instead, to build a string use the format method:

content = "string(//div[@id='article']/p[{}]/.)".format(i)

And by the way, you should be able to use position() just fine with lxml. For instance,

import lxml.html as LH
content = '''\
    <bookstore>
      <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
      </book>
      <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
      </book>
      <book>
        <title lang="eng">Things Fall Apart</title>
        <price>19.99</price>
      </book>
      <book>
        <title lang="eng">Blood Meridian</title>
        <price>9.99</price>
      </book>
    </bookstore>'''
root = LH.fromstring(content)

# Compare with http://stackoverflow.com/a/39242701/190597
print(root.xpath('//book[position()>=1 and position()<=last()]/title/text()'))
# ['Harry Potter', 'Learning XML', 'Things Fall Apart', 'Blood Meridian']

# But note that it is equivalent to 
print(root.xpath('//book/title/text()'))
# ['Harry Potter', 'Learning XML', 'Things Fall Apart', 'Blood Meridian']

print(root.xpath('//book[position()<3]'))

prints

['Harry Potter', 'Learning XML']

which shows that you can select the first N books without having to loop.


As Tomalak mentions, the XPath string function only returns the string representation of the first node. For example,

print(root.xpath('string(//book[position()<3]/title/text())'))

only prints

Harry Potter

If you want a list of strings, then don't use string.

If, as Daniel Haley points out, the desired text is in a mixture of nested nodes and child elements, e.g. <title lang="eng">Harry <b>Potter</b></title>, then you can extract the desired text using the text_content method:

[title.text_content() for title in root.xpath('//book[position()<3]/title')]
Comments