eaglefreeman eaglefreeman - 2 months ago 20
HTML Question

extracting text xpath scrapy

Hi all I would like to extract all the text from an html block using xpath in scrapy

Let's say we have a block like this:

<div>
<p>Blahblah</p>
<p><a>Bluhbluh</a></p>
<p><a><span>Bliblih</span></a></p>
</div>


I want to extract the text as ["Blahblah","Bluhbluh","Blihblih"]. I want xpath to recursively look for text in the div node.
I have heard tried:
//div/p[descendant-or-self::*]/text()
but it does not extract nested elements.

Cheers!
Seb

Answer

You can use XPath's string() function on each p element:

>>> import scrapy
>>> selector = scrapy.Selector(text="""<div>
...    <p>Blahblah</p>
...    <p><a>Bluhbluh</a></p>
...    <p><a><span>Bliblih</span></a></p> 
... </div>""")
>>> [p.xpath("string()").extract() for p in selector.xpath('//div/p')]
[[u'Blahblah'], [u'Bluhbluh'], [u'Bliblih']]
>>> import operator
>>> map(operator.itemgetter(0), [p.xpath("string()").extract() for p in selector.xpath('//div/p')])
[u'Blahblah', u'Bluhbluh', u'Bliblih']
>>> 
Comments