fractal5 fractal5 - 1 month ago 14
JSON Question

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using

substr($originalText,0,250)
;

The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.

I also cannot run json_encode on this string.

However, when I use
substr($originalText,0,251)
, it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.

I can use
mb_convert_encoding($mystring, "UTF-8", "Windows-1252")
to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.

My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).

Hopefully my question is clear, if not I can try to explain further.

Thanks.

pid pid
Answer

UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.

A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.

You cut the string right in the middle of a multi-byte character:

[<-character->]
[byte-0|byte-1]
       ^
      You cut the string right here in the middle!


[<-----character---->]
[byte-0|byte-1|byte-2]
       ^      ^
      Or anywhere here if it's 3 bytes long.

So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.

This causes all the effects you are witnessing.

The solution to this problem is here in Dezza's answer.