user1126515 user1126515 - 2 months ago 16
Java Question

java, utf8, international characters and byte interpretation

I have a String that gets input to my program.

4 letters A, O, "E with an umlaut", L

The hex code for "E with an umlaut" is 0xc38b. see UTF-8 encoding table and Unicode characters and look for "LATIN CAPITAL LETTER E WITH DIAERESIS"

And then it gets weird

My java code is not printing "E with an umlaut" but "A with a ~" followed by 0x8b

When I convert the string to a byte array and the print it out as hex, my 4 character string becomes 7 characters:

byte[0]=41 "A"
byte[1]=4f "O"
byte[2]=c3 c383 is "A with a ~" (per above link)
byte[3]=83
byte[4]=c2 c28b is some kind of control character (per above link)
byte[5]=8b
byte[6]=4c "L"


I have verified my encoding is UTF-8 via Charset.defaultCharset()

It almost looks like its interpreting the bytes incorrectly but how is that possible?

Can anyone shed any light on why the byte interpretation of this string is getting screwed up and how i can correct it?

Answer

Yes everything is correct. Those Unicode characters above U+7F, non 7-bits ASCII, are encoded with multiple bytes, like the (Dutch) U+C38B. Every byte of that sequence have there high bit set. In other character sets, like some Windows single-byte character set, that will be two or more weird characters.

String s = "Zee\uC38Bn van tijd in Belgi\uC38B\r\n";
Path path = "C:/temp/test.txt";
Files.write(path, ("\uFEFF" + s).getBytes(StandardCharsets.UTF_8));

The above writes a text file with a BOM char (zero width space) at the beginning (U+FEFF). This is an ugly redundancy and helps Windows Notepad to recognize the file as UTF-8.


Clarification

The Unicode character U+C38B, in java the java char '\uC38B' is actually . That indeed is converted to 4 bytes in UTF-8.

Ë actually is U+CB or '\u00CB'. Its byte representation in UTF-8 is as follows:

String s = new String(new byte[]{ (byte)0xC3, (byte)0x8B}, 0, 2, StandardCharsets.UTF_8);

That UTF-8 is something totally different than simply splitting the (sequential) Unicode number for that character serves several purposes: the byte sequence is recognizable as part of a multibyte sequence: start and continuation bytes, and normal ASCII like / can never be part of such a byte sequence. So normal ASCII is safe.

Comments