I would like to be able to match the anchor part of an internal link on a page, i.e.:
\W* is greedy and matches the new-line character, causing the regex to wrap across lines and match the
# at the start of the next line. This breaks the second potential match.
You can fix this by replacing
\W* in your regex with
[^\w\n]*, as in this regex:
- Is there a difference between the two? If so, what is the difference?
The only difference is that the second regex uses capturing groups. Otherwise, they are the same.
- Why is neither of those catching the 2nd string in my test sample from those links?
\W* matches any non-word characters, that is,
[^a-zA-Z0-9_]. This means it matches the new-line character
\n and the
# at the start of the next line. In other words, it "wraps" and prevents the regex from matching the second line. See these demos for your regexes:
- Are there any other rules I might need to properly capture any internal link of a document? Are internal links allowed to include symbols and other weird characters that these regexes are not capturing?
Yes. Although the hash (
#) is the only way to indicate an internal link (a/k/a anchor link or hash link), there are lots of ways to create the link. That is, it may not be in the HTML itself. There are lots of possibilities here, such as a fully qualified URL (
#2 pencil) that isn't a link. But trying to talk about all of these issues would make this answer far too long (and would make your question much too broad).