Jeroen de Lau Jeroen de Lau - 2 months ago 10
PHP Question

Find Chinese text in HTML using preg_match

I'm attempting to get the text string from a string of HTML.
I would like to capture only the text between tags and skip over any empty tags.

My attempt is current attempt can be found here:

https://regex101.com/r/3Ujmw6/2


  • I can't use \w since I need to capture Chinese characters

  • I would like only text and not a lot of empty results



I have tried:

/>(\X+?)</g

//I will fail on nested tags, it capture the first nested tag
<p><strong>blablab</strong></p>


And this:

/>(\X*?)</g

//Finds me all the string, but also includes loads of empty strings
//for adjacent tags ><


Is there any way to exclude < from \X? Or is there a better way to write this so it returns only the text parts?

Answer

Try a regex like

>(\s*[^\s<][^<]*)

This simply matches all text between > and < that isn't all whitespace. See https://regex101.com/r/3Ujmw6/4.

Comments