Ruslan Bes Ruslan Bes - 5 months ago 12
PHP Question

Converting from HTML entities to UTF-8

I have a problem converting some encoded strings to utf-8.

I have a list of strings which according to the documentation are Unicode strings encoded using numeric HTML entities. Some of them are:

$str = 'WÖGER'; // seems to be WÖGER
$str = 'Jürgen'; // seems to be Jürgen
$str = 'POßNITZ'; // seems to be POßNITZ
$str = 'SCHLÄGER'; // seems to be SCHLÄGER


I want to decode them and convert to utf-8.

I tried both mb_convert_encoding() with
HTML-ENTITIES
param as well as html_entity_decode(). My best result unexpectedly was with:

html_entity_decode($str, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1');


and that decoded
Jürgen
successfully. However I have no luck decoding other strings from this list. I looked ISO-8859-1 encoding table and HTML codes for umlauts there differ from what I have in my list.

My question is: am I missing some obvious decoding step or is there something wrong with the source strings?

Update (2016-06-27): The original strings were indeed incorrectly encoded. These strings are the result of reading UTF-8 values in Latin-1 context and then encoding individual 1-byte chars as hex entities, so german umlaut
ü
became
ü
and was encoded as 2 separate chars. The accepted answer decodes them straight into UTF-8 successfully.

nj_ nj_
Answer

My understanding is, though I might be wrong, that unicode characters should be represented by their codepoint, and not by encoding individual UTF-8 bytes, which is what you have. So, Ö would be better represented using Ö or in the named form, Ö.

The ENT_XML1 flag to html_entity_decode does seem to make this work, though I'm not entirely sure what it does under the hood. If you want something more explicit:

preg_replace_callback('/&#x([A-Fa-f0-9]{2});/', function ($m) {
    return chr(hexdec($m[1]));
}, $str);