marcamillion marcamillion - 1 month ago 8
Ruby Question

What is the difference between these two regexes to catch internal links on a page?

I would like to be able to match the anchor part of an internal link on a page, i.e.:

"#Welcome"
"#aboutus"
"#services"
"#contactus"


So to do that, I have tried both of these regexes:


  1. /#\w*\W*/
    - http://www.rubular.com/r/I3G9X7zkvS

  2. /#(\w*)(\W*)/
    - http://www.rubular.com/r/b4Eaar1Tn7



But if you visit each of those pages, you will notice that for some reason both skip the 2nd test string -- which I find odd.

So my question is three part:


  1. Is there a difference between the two? If so, what is the difference?

  2. Why is neither of those catching the 2nd string in my test sample from those links?

  3. Are there any other rules I might need to properly capture any internal link of a document? Are internal links allowed to include symbols and other weird characters that these regexes are not capturing?


Answer

TL;DR The \W* is greedy and matches the new-line character, causing the regex to wrap across lines and match the # at the start of the next line. This breaks the second potential match.

You can fix this by replacing \W* in your regex with [^\w\n]*, as in this regex:

/#(\w*)([^\w\n]*)/

Demo

Your questions:

  1. Is there a difference between the two? If so, what is the difference?

The only difference is that the second regex uses capturing groups. Otherwise, they are the same.

  1. Why is neither of those catching the 2nd string in my test sample from those links?

\W* matches any non-word characters, that is, [^a-zA-Z0-9_]. This means it matches the new-line character \n and the # at the start of the next line. In other words, it "wraps" and prevents the regex from matching the second line. See these demos for your regexes: /#\w*\W*/ and /#(\w*)(\W*)/.

  1. Are there any other rules I might need to properly capture any internal link of a document? Are internal links allowed to include symbols and other weird characters that these regexes are not capturing?

Yes. Although the hash (#) is the only way to indicate an internal link (a/k/a anchor link or hash link), there are lots of ways to create the link. That is, it may not be in the HTML itself. There are lots of possibilities here, such as a fully qualified URL (http://example.com/foo/bar#baz), JavaScript links, and many other quirks. And, of course, you might have text that matches your regex (#2 pencil) that isn't a link. But trying to talk about all of these issues would make this answer far too long (and would make your question much too broad).

Comments