Michas Michas - 1 month ago 4
PHP Question

Sanitise UTF-8 in PHP

The function

require a valid UTF-8 string. I have string that may be in a different encoding. I need to ignore or substitute all invalid characters to be able to convert to JSON.

  1. It should be something very simple and robust.

  2. The error is in module for manual checking, so mojibake is fine.

  3. The code responsible for fixing encoding is in different module. (It was broken, thought.) I don’t want to duplicate responsibility.

The hex of example of invalid string:

My current solution:

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = @\iconv('UTF-8', 'UTF-8//IGNORE', $raw_str);

The three problems with my code:

  1. The
    looks little too heavy.

  2. Many programmers don't like

  3. The
    may ignore too much: the whole string.

Any better idea?

There is similar question, but I don't care about conversion.
Ensuring valid utf-8 in PHP


I think this is the best solution.

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = mb_convert_encoding($raw_str, 'UTF-8', 'UTF-8');