robsch robsch - 20 days ago 5
HTML Question

Find the inner-most text in HTML

What would be a regular expression in PHP to find the inner-most text of an HTML string? The tree of the HTML elements has exactly one leave and there can only be a sequence of branches.

Examples where the result is

XXX
(this is not a single string with new-lines; regex would be executed per line):

<a>XXX</a>
<a some-attr="bla" some-attr2="bla2"><b>XXX</b></a>
<a> bla <b>XXX</b></a>


This doesn't need to be assumend:

<a>XXX</a><a>XXX</a>
<a><</a>
<a>></a>


I would think that is should be something like
>(.*?)<
but all characters before and behind would have to be ignored.




Updated to allow an enhanced answer of Wiktor Stribiżew:

An additional task is to replace the found string with PHP by another. This might lead to another pattern as it would be with just finding and getting the inner-most string - not sure.

Answer

You seem to know about the issues that you may experience when using regex with HTML, so please take the regex answer as a learning excercise and use DOM parsing in production if you have to use it with arbitrary HTML code.

IMHO, if you know what you are doing, that is, you are in full control of the generated HTML and you know all < are serialized as HTML entities and all tags consist of alphanumeric/underscore chars, you may use a regex for this:

$html = <<<DATA
<a>XXX</a>
<a some-attr="bla" some-attr2="bla2"><b>XXX</b></a>
<a>   bla   <b>XXX</b></a>
DATA;
echo preg_replace('~(<(\w+)[^<]*?>)[^<]*(<\/\2>)~', '$1YYY$3', $html);

See the PHP demo and a regex demo.

The result is all text inside tags with no tags inside gets replace with YYY:

<a>YYY</a>
<a some-attr="bla" some-attr2="bla2"><b>YYY</b></a>
<a>   bla   <b>YYY</b></a>

Details:

  • (<(\w+)[^<]*?>) - Group 1 capturing <, then capturing into Group 2 (a technical group for us to be able to match the same tag name in the closing tag)1 or more word chars, then any 0+ chars other than < as few as possible (with a negated character class [^<] and the lazy quantifier *?)
  • [^<]* - the text contents: zero or more characters other than <, as many as possible
  • (<\/\2>) - Group 3: <, /, the same text as in Group 2 (the tag name) and a >.

In the replacement, we just use $1 and $3 backreferences to Group 1 and 3 to reinsert the text captured into those groups, and add the YYY replacement text.

Comments