Ed de Almeida Ed de Almeida - 6 months ago 24
Ruby Question

How to get only the text of an element which contains other elements with Xpath?

I am parsing a document with nokogiri, using Xpath. More specifically, I am interested in the contents of a list whose structure is

<ul>
<li>
<div>
<!-- Some data I'm not interested in -->
</div>
<span>
<a href="some_url">A name I already got easily</a>
<br>
Some text I need to get but just can't
</span>
</li>
<li>
<div>
<!-- Some data I'm not interested in again -->
</div>
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
</li>
.
.
.
</ul>


I'm doing this:

politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
politician = OpenStruct.new
politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
politician.url = row.at_xpath('span/a/@href').to_s.strip
politician.party = row.at_xpath('span').to_s.strip
politicians.push(politician)
end


This works fine for politician.name and politician.url, but when it comes to politician party, which is the text after the
<br>
tag, I just can't isolate the text. Using row.at_xpath('span').to_s.strip gives me all the contents of the
<span>
tag, including the other html elements.

Any suggestions about how to get this text?

Answer

span/text() returns empty because the first text node within the <span> is whitespaces (newline and spaces) between the span opening tag and the <a/> element. Try using the following XPath instead :

span/text()[normalize-space()]

This XPath should return non-empty text nodes that is direct child of the <span>