Tom Tom - 8 months ago 14
HTML Question

Seeking regex for HTML attributes meeting specific criteria

I'm trying to remove single quotes and double quotes around HTML attributes with the following restrictions:

1) The quoted material MUST exist within a tag

<mytag b="yes">
<mytag b=yes>
, but
<script>var b="yes"</script>
stays intact).

2) The quoted material may not have a space character nor an equal sign (e.g.,
<mytag b="no no" c="no=no">
stays intact).

3) The quoted material may not be in an

4) The regex should be good for UTF-8 (duh!)

Someone posted a virtually identical question here that received an answer that works within the confines of the question:

Removing single and double quote from html attributes with no white spaces on all attributes except href and src


((\S)+\s*(?<!href)(?<!src)(=)\s*)(\"|\')(\S+)(\"|\'), except it fails to isolate text within tags (i.e., text in between opening and closing tags is erroneously edited, e.g.
<mytag>"The quotes are stripped out here!"</mytag>
), and it doesn't check for equal signs (=) within the quoted text (e.g.
<mytag b="OhNo=TheRoutineRemovedTheQuotesBecauseItDidNotCheckForAnEqualSignInTheQuotedText!">

Bonus points: I wish to integrate this into this php HTML minification routine, which works well except for the edits described above:

His solution pairs the patterns and replacement params in two arrays, as you'll see, so I need to conform to his syntax, which uses
, etc.

Your solution get my upvote!


Here is a pure regex way of getting rid of the quotes:


See the regex demo, replace with '$1'.

IDEONE demo:

$re = '~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|\'([^\s\'=]*)\')~u';
$str = "<mytag src=\"src_here\" b=\"yes\" href=\"href_here\"> becomes <mytag src=\"src_here\" b=yes href=\"href_here\">\n<mytag b='yes'> becomes <mytag b=yes>\nbut <script>var b=\"yes\"</script> stays intact\n<mytag b=\"no no\" c=\"no=no\"> stays intact\n<tag href=\"something\"> text <tag src=\"dddd\"> intact"; 
$subst = "$1"; 
$result = preg_replace($re, $subst, $str);
echo $result;

Pattern details:

  • (?:<\w+|(?!^)\G) - match the tag (<\w+) or (|) the end of the last successful match ((?!^)\G)
  • (?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))* - matches the unwelcome href and src attributes to later omit them with \K
  • \s+ - match 1+ whitespace(s)
  • (?!(?:href|src)=)\w+= - 1+ alphanumeric or underscore characters (\w+) followed with = that are not href= or src= (see (?!(?:href|src)=) negative lookahead)
  • \K - omit the whole text matched so far
  • (?|"([^\s"=]*)"|\'([^\s\'=]*)\') - a branch reset group capturing into Group 1 either:
    • "([^\s"=]*)" - double quoted attribute with no =, ' and whitespace
    • | - or
    • \'([^\s\'=]*)\' - single quoted attribute with no =, ' and whitespace