splattne splattne - 4 months ago 11
HTML Question

Regular expression for extracting tag attributes

I'm trying to extract the attributes of a anchor tag (

<a>
). So far I have this expression:

(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+


which works for strings like

<a href="test.html" class="xyz">


and (single quotes)

<a href='test.html' class="xyz">


but not for string without quotes:

<a href=test.html class=xyz>


How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?

Thanks!

Update: Thanks for all the good comments and advices so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by myself. And there is no time/money to rewrite this stuff from bottom up.

Answer

If you have an element like

<name attribute=value attribute="value" attribute='value'>

this regex could be used to find successively each attribute name and value

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Applied on:

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

it would yield:

'href' => 'test.html'
'class' => 'xyz'
Comments