My question arises from this answer, which says:
Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.
"ℤ".getBytes(StandardCharsets.UTF_8).length == 3
"ℤ".getBytes(StandardCharsets.UTF_16).length == 4
It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).
0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a 'sequence number' mapped to a certain character. That sequence number is called a code point, and it's often written down as a hexadecimal number.
How that certain number is encoded, might take up more bytes than the raw code point would.
Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain number (lets call it N) of
1 bits followed by a
0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits
Hex 0x2124 = binary 100001 00100100
According to abovementioned rules, this converts to the following UTF-8 encoding:
11100010 10000100 10100100 <-- Our UTF-8 encoded result ^ ^ ^ ^ ^ ^ ^ AaaaBbDd CcDddddd CcDddddd <-- Some notes, explained below
A is a set of ones (followed by a zero) which denote the number of bytes belonging to this character (three
1s = three bytes).
B is padding, because otherwise the total number of bits is not divisible by 8.
C is the concatenation bits (each subsequent byte starting with
D is the actual bits of our code point.
So indeed, the character ℤ takes up three bytes.