Mikusch Mikusch - 2 months ago 9
HTML Question

Regex to match anything with href="" but between two other tags

I already have this Regex pattern that checks for every

href=""
in my document:

\(href\s*=\s*(?:"|')(.*?)(?:"|'))


Now I want it to match all
href
s ONLY in between
<a
and
</a>
tags, with other parameters still allowed in between.

Do not match:

<base href="http://www.w3schools.com/images/" target="_blank">

<link rel="apple-touch-icon" sizes="57x57" href="/apple-icon-57x57.png">


Match:


<a href="http://www.w3schools.com/"></a>

<a class="re" href="http://www.w3schools.com/"></a>

<a href="http://www.w3schools.com/" class="re">This is a link</a>


Thanks in advance, I've not been able to solve this problem as of yet.

Answer

Note: Due to the limitation of language classes (regular, stack), this can't be done 100%. But a close approximation is:

<a\b[^>]*\shref="([^"]*)"

Or, if you use named subexpressions:

<a\b[^>]*\shref=(?P<QUOTE>["'])(?P<URL>.*?)(?P=QUOTE)

Which will also handle apostrophe-delimited attributes.

The last example can also be rewritten as:

<a\b[^>]*\shref=(["'])(.*?)(\1)

but remember to use the second subexpression, not the first one.

It wasn't clear whether you want to grab the name of the link, but if you do, whichever regex you choose, you can add a simple appendix to get the name. For example, for the named subexpressions:

<a\b[^>]*\shref=(?P<QUOTE>["'])(?P<URL>.*?)(?P=QUOTE)[^>]*>(?P<NAME>.*?)</a>