Pekka 웃 Pekka 웃 - 1 year ago 72
JSON Question

How to keep json_encode() from dropping strings with invalid characters

Is there a way to keep

from returning
for a string that contains an invalid (non-UTF-8) character?

It can be a pain in the ass to debug in a complex system. It would be much more fitting to actually see the invalid character, or at least have it omitted. As it stands,
will silently drop the entire string.

Example (in UTF-8):

$string =
array(utf8_decode("Düsseldorf"), // Deliberately produce broken string


Results in


Desired result:


Note: I am not looking to make broken strings work in json_encode(). I am looking for ways to make it easier to diagnose encoding errors. A
string isn't helpful for that.

Answer Source

php does try to spew an error, but only if you turn display_errors off. This is odd because the display_errors setting is only meant to control whether or not errors are printed to standard output, not whether or not an error is triggered. I want to emphasize that when you have display_errors on, even though you may see all kinds of other php errors, php doesn't just hide this error, it will not even trigger it. That means it will not show up in any error logs, nor will any custom error_handlers get called. The error just never occurs.

Here's some code that demonstrates this:

error_reporting(-1);//report all errors
$invalid_utf8_char = chr(193);

ini_set('display_errors', 1);//display errors to standard output

ini_set('display_errors', 0);//do not display errors to standard output
var_dump(error_get_last());// json_encode(): Invalid UTF-8 sequence in argument

That bizarre and unfortunate behavior is related to this bug and a few others, and doesn't look like it will ever be fixed.


Cleaning the string before passing it to json_encode may be a workable solution.

$stripped_of_invalid_utf8_chars_string = iconv('UTF-8', 'UTF-8//IGNORE', $orig_string);
if ($stripped_of_invalid_utf8_chars_string !== $orig_string) {
    // one or more chars were invalid, and so they were stripped out.
    // if you need to know where in the string the first stripped character was, 
    // then see
$json = json_encode($stripped_of_invalid_utf8_chars_string);

The manual says

//IGNORE silently discards characters that are illegal in the target charset.

So by first removing the problematic characters, in theory json_encode() shouldnt get anything it will choke on and fail with. I haven't verified that the output of iconv with the //IGNORE flag is perfectly compatible with json_encodes notion of what valid utf8 characters are, so buyer there may be edge cases where it still fails. ugh, I hate character set issues.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download