Awatatah Awatatah - 9 days ago 4
Ruby Question

How to access multiple <p> tags one at a time

I have the following HTML:

<div id="test_id">
<p>Some words.</p>
<p>Some more words.</p>
<p>Even more words.</p>
</div>


If I parse the HTML using:

doc = Nokogiri::HTML(open("http://my_url"))


and run

doc.css('#test_id').text


in the console I get:

=> "Some words.\nSome more words.\nEven more words"


How do I get the first
<p>
element only?




I think I figured it out with
.children


doc.css('#test_id').children[0].text


Is this the correct way to do this?

Answer

The problem is that you're not using text on the right type of object.

If you're looking at a NodeSet the text documentation says:

Get the inner text of all contained Node objects

If you're looking at a Node AKA Element, it says:

Returns the content for this Node

Here's the difference:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="test_id">
    <p>Some words.</p>
    <p>Some more words.</p>
    <p>Even more words.</p>
</div>
EOT

doc.search('p').class  # => Nokogiri::XML::NodeSet
doc.search('p').text  # => "Some words.Some more words.Even more words."

doc.at('p').class  # => Nokogiri::XML::Element
doc.at('p').text  # => "Some words."

at is like search(...).first.

Typically, if we want the text of a NodeSet we'd use:

doc.search('p').map(&:text)  # => ["Some words.", "Some more words.", "Even more words."]

which makes it easy to pick the text of a specific node.

doc.css('#test_id').children[0].text

Well, yeah, you can do that, but children isn't going to do the same thing:

doc.search('#test_id').children
# => [#<Nokogiri::XML::Text:0x3fc31580ca24 "\n    ">, #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Text:0x3fc315107f44 "\n    ">, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Text:0x3fc315107b20 "\n    ">, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>, #<Nokogiri::XML::Text:0x3fc3151076fc "\n">]
doc.search('#test_id').children[0] # => #<Nokogiri::XML::Text:0x3fc31580ca24 "\n    ">
doc.search('#test_id').children[1] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>

versus:

doc.search('#test_id p')
# => [#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>]
doc.search('#test_id p')[0] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
doc.search('#test_id p')[1] # => #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>

Notice how children is returning the text nodes between the tags used to format the HTML. You have to be aware that children returns everything in the HTML below the selected tag. This is useful sometimes but for general text retrieval it's probably not what you want.

Instead, use the more selective '#test_id p' selector and iterate over the returned NodeSet and you'll avoid the formatting text nodes and won't have to account for them when using a slice or index into the NodeSet.