Christopher Allen Christopher Allen - 3 months ago 27
Node.js Question

Cheerio Node.JS External Title link issue

Im building a sample test scraper to learn about Cheerio and jQuery.

I'm scratching my head on a secondary request after I have received a group of URLs and stored them, is to do another request to load those URLs and pull the title from the header of that page.

My code looks like this.

var request = require('request'),
cheerio = require('cheerio');
urls = [];
titles = [];
request('http://reddit.com', function(err, resp, body){
if(!err && resp.statusCode == 200){
var $ = cheerio.load(body);

$('a.title', '#siteTable').each(function(){
var url = $(this).attr('href');
urls.push(url);
});
//issue is here
for(var i = 0; i < urls.length; i++){
request(urls[i], function(err, resp, body){
var $ = cheerio.load(body);

var title = $("title").text();

console.log(title);
});
}
}
});


It seems that I get a property of undefined somewhere when attributing the title from the page.

I must mention I am new to jQuery so this code probably looks ridiculous (I'm Assuming).

The error I receive from the console is,

TypeError: Cannot read property 'parent' of undefined
at Function.exports.update (/home/pi/node_modules/cheerio/lib/parse.js:55:25)
at module.exports (/home/pi/node_modules/cheerio/lib/parse.js:17:11)
at Function.exports.load (/home/pi/node_modules/cheerio/lib/static.js:19:14)
at Request._callback (/home/pi/scraper.js:16:22)
at self.callback (/home/pi/node_modules/request/request.js:187:22)
at Request.emit (events.js:95:17)
at Request.init (/home/pi/node_modules/request/request.js:275:17)
at new Request (/home/pi/node_modules/request/request.js:129:8)
at request (/home/pi/node_modules/request/index.js:55:10)
at Request._callback (/home/pi/scraper.js:15:6)


I understand that this error means that I have a variable undefined and I'm trying to make a secondary attribute like .someThing but the error points to the callback function in the second reqeust.

Any advice on how I could fix this?

Answer

One of the URL's returned looks like this

/r/Jokes/comments/4yp0ex/mom_dont_freak_out_but_im_in_the_hospital/

There could be others, but looking at reddit one can clearly see the anchor, and the href

<a class="title may-blank " href="/r/Jokes/comments/4yp0ex/mom_dont_freak_out_but_im_in_the_hospital/" tabindex="1" rel="">"Mom? Don't freak out, but I'm in the hospital..."</a>

Of course, trying to use request to get an URL with no protocol or domain, fails, and everything crashes.

You have to handle internal links by adding the domain and creating absolute URL's, a simple way to do that would be something like

for (var i = 0; i < urls.length; i++) {
  var uri = (/^(f|ht)tps?:\/\//i.test(urls[i]) ? "" : "https://www.reddit.com") + urls[i];

  request(uri, function(err, resp, body) {
    if (err) {
      // handle errors
    } else {
        var $ = cheerio.load(body);
        var title = $("title").text();

        console.log(title);
    }
  });
}

Running that, you'll see that after a few URL's, you encounter a "502 Bad gateway", and now you have to handle that, and probably many other things, as there's no guarantee all the crappy links posted on Reddit actually works.