karacsi_maci karacsi_maci - 8 months ago 58
PHP Question

PHP function mb_detect_encoding strict mode

In the function mb_detect_encoding there is a parameter for strict mode.

In the first, most upvoted comment:

$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false

This is true, yes. But can anybody give me an explanation, why is it?


Everything in this answer is based on my reading of the code here and here.

I did not write it, I did not step through it with a debugger, this is my interpretation only.

It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.

However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.


The byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding() properly returns false for it regardless of which mode is used.

$str = "\xf8foo";

    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)

But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.

$str = "foo\xf8";

    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)

So while your ISO-8859-1 string 'áéóú' is not valid UTF-8, the first byte "\xe1" can occur in UTF-8 and mb_detect_encoding() mistakenly returns the string as such.

*I've opened a report for this at https://bugs.php.net/bug.php?id=72933