marcamillion marcamillion - 4 months ago 8x
Ruby Question

How do I parse Nokogiri elements stored in an array element?

I crawled a page and stored elements from the page into an array.

If I inspect the first element:

puts "The inspection of the first my_listing: "
puts my_listing.first.first.inspect

The output is:

The inspection of the first my_listing:
#<Nokogiri::XML::Element:0x80c58764 name="p" children=[#<Nokogiri::XML::Text:0x80c584e4 " May 4 - ">, #<Nokogiri::XML::Element:0x80c58494 name="a" attributes=[#<Nokogiri::XML::Attr:0x80c58340 name="href" value="">] children=[#<Nokogiri::XML::Text:0x80c57f08 "residual income No experience is needed!!!">]>, #<Nokogiri::XML::Text:0x80c57da0 " - ">, #<Nokogiri::XML::Element:0x80c57d50 name="font" attributes=[#<Nokogiri::XML::Attr:0x80c57bfc name="size" value="-1">] children=[#<Nokogiri::XML::Text:0x80c577c4 " (online)">]>, #<Nokogiri::XML::Text:0x80c5765c " ">, #<Nokogiri::XML::Element:0x80c5760c name="span" attributes=[#<Nokogiri::XML::Attr:0x80c574b8 name="class" value="p">] children=[#<Nokogiri::XML::Text:0x80c57080 " img">]>]>

How do I access each element? For instance, how do I access the first
element in this object which would be 'May 4 - '?

If I do:

puts my_listing.first.first.text,

I get this output:

May 4 - residual income No experience is needed!!! - (online) img

Also, how do I access the


which does not work.


Please note that Nokogiri treats everything as nodes - be it a text, attribute, or an element. Your document has one child:

irb(main):014:0> my_listing.children.size
=> 1
irb(main):015:0> puts my_listing.children
<p> May 4 - <a href="">residual income No
experience is needed</a> - <font size="-1"> (online)</font> <span class="p">
=> nil

By the way, puts uses to_s method, and that method assembles texts from all children - this is why you see more text than you want.

If you go deeper to see the children of that single element, you have:

irb(main):017:0> my_listing.children.first.children.size
=> 6
irb(main):018:0> puts my_listing.children.first.children
 May 4 - 
<a href="">residual income No
experience is needed</a>
<font size="-1"> (online)</font>

<span class="p"> img</span>
=> nil

To get what you asking about, keep going down the hierarchy:

irb(main):022:0> my_listing.children.first.children[0]
=> #<Nokogiri::XML::Text:0x..fd9d1210e " May 4 - ">
irb(main):023:0> my_listing.children.first.children[0].text
=> " May 4 - "
irb(main):024:0> my_listing.children.first.children[1]['href']
=> ""