Evgeny Evgeny - 1 year ago 188
Ruby Question

Getting text only when nokogiri certain HTML structure

I've been struggling with nokogiri lib in order to fetch (scrape) content from web, I failed to understand how to get only text without nested tags.
Here is what I parse

<div class="line1">text I need
<br><div class="podp_k">group:</div><a class="GR" title="go to this group" href="#" rel="?sectID=2">group 1</a>
<div class="podp_k">brand:</div><a class="BR" title="go to brand" href="#" rel="?sectID=0&amp;brand=16">China&nbsp;&nbsp;CHINA</a>

Here is the way I scrape it

tagcloud_elements = nokogiri_object.css("div#products_tbody > table > tbody > tr > td > div.line1 > text()")
f.puts tagcloud_element.text.gsub(/^\s+/,'')

the gsub at the end does almost exactly I need, but I lefts number of whitespaces after. Can anybody suggest the best way to get only "text I need" from the above example please?

Answer Source

I would delete the other nodes that are in this section if you're not using the document any further.

nokogiri_object.css("div.line1 *").each(&:remove)
nokogiri_object.at_css("div.line1").text.strip # => "text I need"
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download