karacsi_maci karacsi_maci - 3 months ago 23
PHP Question

PHP function mb_detect_encoding strict mode

In the function mb_detect_encoding there is a parameter for strict mode.

In the first, most upvoted comment:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false


This is true, yes. But can anybody give me an explanation, why is it?

Answer

Everything in this answer is based on my reading of the code here and here.

I did not write it, I did not step through it with a debugger, this is my interpretation only.


It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.

However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.

Example:

The byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding() properly returns false for it regardless of which mode is used.

$str = "\xf8foo";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.

$str = "foo\xf8";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

So while your ISO-8859-1 string 'áéóú' is not valid UTF-8, the first byte "\xe1" can occur in UTF-8 and mb_detect_encoding() mistakenly returns the string as such.


*I've opened a report for this at https://bugs.php.net/bug.php?id=72933

Comments