PiperWarrior PiperWarrior - 2 months ago 11
Ruby Question

Cannot access Nokogiri element within block

I run the following successfully:

require 'nokogiri'
require 'open-uri'

own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')

p own_table.css('tr').css('td')[4].css('a').attr('href').value


=> "/Archives/edgar/data/0001513362/000162828016019444/0001628280-16-019444-index.htm"

However, when I try to use the element above in a block (as shown in code below), I get a NoMethodError for nil:NilClass.

I am confused, because I thought that the local variable link in the block would be the same object as in the code above.

Furthermore, if I change the definition of link below to:

link = row.css('td')[4].class

I get a hash without error, saying the value of link is Nokogiri::XML::Element.

Can anyone explain, why I have a Nokogiri::XML::Element object, but cannot run the css method on it. Especially when I can run it in the first snippet?

require 'nokogiri'
require 'open-uri'

own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')


own_table.css('tr').each do |row|
names = [:acq, :transaction_date, :execution_date, :issuer, :form, :transaction_type, :direct_or_indirect_ownership, :number_of_securities_transacted, :number_of_securities_owned, :line_number, :issuer_cik, :security_name, :url]
values = row.css('td').map(&:text)
link = row.css('td')[4].css('a').attr('href').value
values << link
hash = Hash[names.zip values]
puts hash
end

secown.rb:11:in `block in <main>': undefined method `css' for nil:NilClass (NoMethodError)
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
from secown.rb:8:in `<main>'

Answer

The crucial insight is that in the first case, own_table.css('tr') returns a NodeSet, .css('td') finds all the td that is descendant to any nodes in that nodeset, then finds the fourth one (speaking as a programmer, fifth for normal people :P ).

The second snippet treats each row individually as a Node, then finds all descendant td, then picks the fourth one.

So if you have this structure:

tr id=1
  td id=2
  td id=3
tr id=4
  td id=5
  td id=6
  td id=7
  td id=8
  td id=9

then the first snippet will give you the id 7 td (it being the fourth td in all tr); the second snippet would try to find the fourth td in id 1 tr, then fourth td in id 4 tr, but it errors out because id 1 tr doesn't have a fourth td.

Edit: Specifically, having checked your URL, the first tr has no td; all the others have 12. So own_table.css('tr')[0].css('td')[4].class is NilClass, not Nokogiri::XML::Element as you report.