splattne splattne - 1 year ago 50
HTML Question

Regular expression for extracting tag attributes

I'm trying to extract the attributes of a anchor tag (

). So far I have this expression:

(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+

which works for strings like

<a href="test.html" class="xyz">

and (single quotes)

<a href='test.html' class="xyz">

but not for string without quotes:

<a href=test.html class=xyz>

How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?


Update: Thanks for all the good comments and advices so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by myself. And there is no time/money to rewrite this stuff from bottom up.

Answer Source

If you have an element like

<name attribute=value attribute="value" attribute='value'>

this regex could be used to find successively each attribute name and value


Applied on:

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

it would yield:

'href' => 'test.html'
'class' => 'xyz'