MakoBuk MakoBuk - 1 month ago 8
Javascript Question

JavaScript - regex order doesn't matter but existence required

I want to get content of canonical link from page. The code is in Node.js on server (without DOMs). I have complete body of response (downloaded page) and following code:

var metaRegex = new RegExp(/<link.*?href=['"](.*?)['"].*?rel=['"]canonical['"].*?>/i);
// return correctly: https://support.google.com/recaptcha/?hl=en
// var metaRegex = new RegExp(/<link(?=.*rel=['"]canonical['"])(?=.*href=['"](.*?)['"]).*?>/i);
// return incorrectly: https://www.google.com/accounts/TOS
var metaTag = metaRegex.exec(body);
console.log(metaTag[1]);


JsFiddle.

In the first expression is problem with order of rel and href attributes. It takes only:

<link href="https://support.google.com/recaptcha/?hl=en" rel="canonical">


and NOT

<link rel="canonical" href="https://support.google.com/recaptcha/?hl=en">


The second expression takes both ordering, but it match the last occurrence of href.

It looks like if I should require existence of both attributes and may group it?

What is the correct way?

Answer

Just use two sequential RegExps, like that:

var body = '<link rel="stylesheet" href="my.css"/> <link href="https://support.google.com/recaptcha/?hl=en" rel="canonical"/> <a href="https://www.google.com/accounts/TOS"/>'
var linkRegexp = /(<link[^>]*rel=['"]canonical['"][^>]*>)/;
var hrefRegexp = /href=['"](.*?)['"]/;

var linkBody = linkRegexp.exec(body)[1];
console.log(hrefRegexp.exec(linkBody)[1]);
  • linkRegexp - get the link with rel='canonical'
  • hrefRegexp - extract href from it
Comments