Shinon Chan Shinon Chan - 26 days ago 10
Ruby Question

Nokogiri results different from brower inspect

I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.

In my browser I get normal links but all the a HREF links all become

javascript:void(0);
from Nokogiri.

Here is the site:

https://www.ctgoodjobs.hk/jobs/part-time


Here is my code:

url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text

Answer

is not that easy, urls are "obscured" using a js function, that's why your getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:

<div class="result-list-job current-view">
  <input type="hidden" name="job_id" value="04375145">
  <input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
  <h2 class="job-title"><a href="javascript:void(0);">Barista/ Senior Barista 咖 啡 調 配 員</a></h2>
  <h3 class="job-company"><a href="/company-jobs/pacific-coffee-company/00028652" target="_blank">PACIFIC COFFEE CO. LTD.</a></h3>
  <div class="job-description">
    <ul class="job-desc-list clearfix">
      <li class="job-desc-loc job-desc-small-icon">-</li>
      <li class="job-desc-work-exp">0-1 yr(s)</li>
      <li class="job-desc-salary job-desc-small-icon">-</li>
      <li class="job-desc-post-date">09/11/16</li>
    </ul>
  </div>
  <a class="job-save-btn" title="save this job" style="display: inline;"> </a>
  <div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
  <div class="job-cat job-cat-de"></div>
</div>

then, you can retrieve each job_id from those inputs, like:

 inputs = doc.search('//input[@name="job_id"]')

and then build the urls (i found the base url at joblist_preview.js:

 urls = inputs.map do |input|
   "https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
 end