Rubyx Rubyx - 1 month ago 9
Ruby Question

How to extract text after <br> using Mechanize

I want to extract text after the first

<br>
(State).

The HTML code is:

<div class="location">
Country
<br>
State
<br>
City
</div>


Currently I can extract all the
<div>
text with:

a = Mechanize.new
page = a.get(url)
state = page.at('.location').text
puts state


Any ideas?

Answer

It's easy, but you have to understand how a document is represented inside Nokogiri in the DOM.

There are tags, which are Element nodes, and the intervening text, which are Text nodes:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="location">
    Country
    <br>
    State
    <br>
    City
</div>
EOT

doc.at('.location br').next_sibling.text.strip # => "State"

Here's what Nokogiri says <br> is:

doc.at('.location br').class # => Nokogiri::XML::Element

And the following Text node:

doc.at('.location br').next_sibling.class # => Nokogiri::XML::Text

And how we access the content of the text node:

doc.at('.location br').next_sibling.text # => "\n    State\n    "

And again, looking at the <div> tag and its next sibling node:

doc.at('.location').class # => Nokogiri::XML::Element
doc.at('.location').next_sibling.class # => Nokogiri::XML::Text
doc.at('.location').next_sibling # => #<Nokogiri::XML::Text:0x3fcf58489c7c "\n">

By the way, you can access Mechanize's Nokogiri parser to play with the DOM using something like:

require 'mechanize'

agent = Mechanize.new  
page = agent.get('http://example.com')
doc = page.parser

doc.class # => Nokogiri::HTML::Document
doc.title # => "Example Domain"
Comments