Gary Willoughby Gary Willoughby - 6 months ago 10
PHP Question

Testing for UTF8 strings in PHP, is this a reliable method?

I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding() but it was incorrcted when reading from a ISO8859-1 file on a Linux os.

This function seems to work in all cases I tested.

Here is the question: Get file encoding

Here is the function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
    )+%xs', $string);
}


Is this a reliable way of detecting UTF8 strings?
What exactly is it doing?
Can it be made more robust?

Answer

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.

In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.