cashman04 cashman04 - 4 years ago 328
Ruby Question

Get content after header tag with nokogiri

I am playing with nokogiri just to learn it and am trying to write a little CL scraper. Right now I am trying to match up each State on the main page with the cities underneath. Below is a snippet of the HTML:

<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li><a href="//auburn.craigslist.org/">auburn</a></li>
<li><a href="//bham.craigslist.org/">birmingham</a></li>
<li><a href="//dothan.craigslist.org/">dothan</a></li>
<li><a href="//shoals.craigslist.org/">florence / muscle shoals</a></li>
<li><a href="//gadsden.craigslist.org/">gadsden-anniston</a></li>
<li><a href="//huntsville.craigslist.org/">huntsville / decatur</a></li>
<li><a href="//mobile.craigslist.org/">mobile</a></li>
<li><a href="//montgomery.craigslist.org/">montgomery</a></li>
<li><a href="//tuscaloosa.craigslist.org/">tuscaloosa</a></li>
</ul>
<h4>Alaska</h4>
<ul>
<li><a href="//anchorage.craigslist.org/">anchorage / mat-su</a></li>
<li><a href="//fairbanks.craigslist.org/">fairbanks</a></li>
<li><a href="//kenai.craigslist.org/">kenai peninsula</a></li>
<li><a href="//juneau.craigslist.org/">southeast alaska</a></li>
</ul>


I can already pull out just this div class of "colmask" easy enough. But now I am just trying to get the UL directly after each h4, but can't find a way to do it so far. Suggestions?

Answer Source

You can get ul elements after h4 using following-sibling:

require 'nokogiri'

html = <<-EOF
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li><a href="//auburn.craigslist.org/">auburn</a></li>
<li><a href="//bham.craigslist.org/">birmingham</a></li>
<li><a href="//dothan.craigslist.org/">dothan</a></li>
<li><a href="//shoals.craigslist.org/">florence / muscle shoals</a></li>
<li><a href="//gadsden.craigslist.org/">gadsden-anniston</a></li>
<li><a href="//huntsville.craigslist.org/">huntsville / decatur</a></li>
<li><a href="//mobile.craigslist.org/">mobile</a></li>
<li><a href="//montgomery.craigslist.org/">montgomery</a></li>
<li><a href="//tuscaloosa.craigslist.org/">tuscaloosa</a></li>
</ul>
<h4>Alaska</h4>
<ul>
<li><a href="//anchorage.craigslist.org/">anchorage / mat-su</a></li>
<li><a href="//fairbanks.craigslist.org/">fairbanks</a></li>
<li><a href="//kenai.craigslist.org/">kenai peninsula</a></li>
<li><a href="//juneau.craigslist.org/">southeast alaska</a></li>
</ul>
EOF

doc = Nokogiri::HTML(html)
doc.xpath('//h4/following-sibling::ul').each do |node|
  puts node.to_html
end

To select ul after an h4 with exact text:

puts doc.xpath("//h4[text()='Alabama']/following-sibling::ul")[0].to_html
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download