Michas Michas - 2 months ago 17
PHP Question

Sanitise UTF-8 in PHP

The function

json_encode
require a valid UTF-8 string. I have string that may be in a different encoding. I need to ignore or substitute all invalid characters to be able to convert to JSON.


  1. It should be something very simple and robust.

  2. The error is in module for manual checking, so mojibake is fine.

  3. The code responsible for fixing encoding is in different module. (It was broken, thought.) I don’t want to duplicate responsibility.



The hex of example of invalid string:
496e76616c6964206d61726b2096


My current solution:

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = @\iconv('UTF-8', 'UTF-8//IGNORE', $raw_str);


The three problems with my code:


  1. The
    iconv
    looks little too heavy.

  2. Many programmers don't like
    @
    .

  3. The
    iconv
    may ignore too much: the whole string.



Any better idea?

There is similar question, but I don't care about conversion.
Ensuring valid utf-8 in PHP

Answer

I think this is the best solution.

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = mb_convert_encoding($raw_str, 'UTF-8', 'UTF-8');
Comments