user3257966 user3257966 - 20 days ago 5
PHP Question

Get all words from text containing html tags with php regex

I am currently trying with PHP to get all words from text that contain html tags

My regex has a problem, if a word is finished with an accent ( "é" for example ), my word is not caught.

My regex is

$re = '/([^\r\n\t\f>< /]+(?!>))\b/';
$str = 'Non ! Non ! Je ne veux pas d\'un éléphant dans un boa.<br>
<p> Un boa c\'est très dangereux, et un éléphant élévé c\'est très encombrant. Chez moi c\'est tout petit. J\'ai besoin d\'un mouton. Dessine-moi un mouton.
</p>
-Laisse-moi dire mouton... For saints have hands that pilgrims\' hands do touch


';

preg_match_all($re, $str, $matches);

// but word elevé is not completely match
print_r($matches);


but, in my example, the word "élévé" in not match

Please find an example here :
regex live example

Why does this regular expression not match the last character with accents?

Answer

If you want to use a regex, you could use:

<[^>]+>(*SKIP)(*FAIL)|([A-zÀ-ÿ]+)

Working demo

Note that character range in the regex character class uses ASCII range, I put the simplest way, but bear in mind that that range contains symbols that you might not want. If you want to support specific characters check the ascii table and use the range you want

Additionally, if you want to capture c'est as a single word, then just add the single quote in the character class as this:

<[^>]+>(*SKIP)(*FAIL)|([A-zÀ-ÿ']+)

Edit: if you check bobble bubble comment, you will find a very useful usage of the unicode flag. Quoting his comment, you can use a very easy regex by leverage u (unicode) flag like this:

<[^>]+>(*SKIP)(*FAIL)|([\w']+)

Working demo

If you want words separated by - like Dessine-moi to be matched as a single word instead of 2, just add the hyphen to the caracter class like this:

<[^>]+>(*SKIP)(*FAIL)|([\w'-]+)

Edit 2: since you edited your question a 2nd time and also commented that you don't want the initial hyphen, then you can use this regex:

<[^>]+>(*SKIP)(*FAIL)|([\w']+(?:[\w'-]*))

Working demo