Jonathan S. Fisher Jonathan S. Fisher - 7 months ago 29
Java Question

Why doesn't Java's `String.toCharArray()` and `new String(char[])` methods accept a charset encoding?

Why doesn't Java's

new String(char[])
methods accept a charset encoding?

If you're using
you can optionally specify a charset using
new String(byte[], charset)

I was wondering if there's a something about
and charset encodings I don't understand. Nothing particular in the Javadocs seems to explain the difference.


These methods don't perform encoding, they simply represent a copy of the String instance's internal state.

Encoding is the process of converting logical glyphs to a numeric representation, a series of bytes. Think of a String as representing a sequence of Unicode glyphs. The String class has APIs to access these glyphs as 32-bit code points, or as a series of 16-bit values encoded with UTF-16-BE (which happens to be the string's native, internal representation), or as a series of bytes in a chosen encoding. You only need to specify the encoding in the last case.

Some encodings, like UTF-8, support all Unicode characters, while many others, like US-ASCII, support only a tiny subset. The char[]-based APIs don't allow specifying a different encoding (UTF-16-LE, or UTF-16 with a BOM) because one is sufficient, and promoting uniformity minimizes errors from mismatched encodings.