Kevin Krumwiede Kevin Krumwiede - 1 month ago 13
Java Question

If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?

My question arises from this answer, which says:


Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.


If that's correct, then why is
"ℤ".getBytes(StandardCharsets.UTF_8).length == 3
and
"ℤ".getBytes(StandardCharsets.UTF_16).length == 4
?

Answer

It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a 'sequence number' mapped to a certain character. That sequence number is called a code point, and it's often written down as a hexadecimal number.

How that certain number is encoded, might take up more bytes than the raw code point would.


Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain number (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

Hex 0x2124 = binary 100001 00100100

According to abovementioned rules, this converts to the following UTF-8 encoding:

11100010 10000100 10100100    <-- Our UTF-8 encoded result
^   ^ ^  ^ ^      ^ ^
AaaaBbDd CcDddddd CcDddddd    <-- Some notes, explained below

A is a set of ones (followed by a zero) which denote the number of bytes belonging to this character (three 1s = three bytes).
B is padding, because otherwise the total number of bits is not divisible by 8.
C is the concatenation bits (each subsequent byte starting with 10).
D is the actual bits of our code point.

So indeed, the character ℤ takes up three bytes.