Circumspect Squid Circumspect Squid - 3 months ago 29
Java Question

Does specifying the encoding in javac yield the same results as changing the active code page in Windows CMD and then compiling directly?

I am trying to compile a piece of Java code in Windows CMD using Windows-1250 encoding, and I can't seem to get the -encoding option to work right.

The compiler just doesn't seem to use the specified encoding unless there are illegal characters, in which case it just displays the error message. Otherwise it uses the active code page anyway.

In particular, I am trying to display a string containing Albanian characters, specifically 'ë'.

The string I need to display is as follows:

Hëllë Wërld


Here are the commands I am using and the output they produce:

chcp
Output: Active code page: 437

javac -encoding Windows-1250 AlbanianHello.java

java AlbanianHello
Output: Hδllδ Wδrld


As you can see, it still uses the default encoding, which is Cp437, even though I specified the encoding I wish to use.

Now this is what happens when I change the code page to 1250 and then compile without specifying the encoding:

chcp 1250
Output: Active code page: 1250

javac AlbanianHello.java
java AlbanianHello
Output: Hëllë Wërld


Seems to work properly.

Specifying the encoding in this case yields the same results:

chcp 1250
Output: Active code page: 1250

javac -encoding Windows-1250 AlbanianHello.java
java AlbanianHello
Output: Hëllë Wërld


So does it just completely ignore my specified encoding? Not quite. When I try to use the encoding that is not supposed to work with my string, it displays a bunch of error messages:

javac -encoding UTF8 AlbanianHello.java
Output: AlbanianHello.java:5: error: unmappable character for encoding UTF8
System.out.println("H?ll? W?rld");
^
...
3 errors


My question is:
Why does it ignore the encoding when it should theoretically work, and doesn't ignore it when it shouldn't work?

I would also like to know if there is any difference in the result between these commands:

chcp 1250
javac AlbanianHello.java


And these ones:

chcp 1250
javac -encoding Windows-1250 AlbanianHello.java

cxw cxw
Answer

Welcome to the site! The javac encoding option sets how javac will map the bytes in your source file to Unicode characters, since Java uses Unicode internally. The chcp command sets how the Windows console will map bytes of output to glyphs in a font. Java doesn't know or care about chcp, and vice versa. If both match, all is well. If not...

In your first example, Java correctly interprets your Windows-1250 source. Character ë is U+00EB. When that byte (0xEB) is output to a code-page 437 terminal, the displayed result is what byte 0xEB means in cp437, regardless of what you thought you wanted to display. Per the CP437 character table, that is lowercase delta, δ. (Just to highlight the difference, δ is U+03B4 in Unicode.)

For completeness, it turns out to be less than easy to find out what the default encoding for javac is. The docs for Charset say that:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

Based on the behaviour you saw, I am guessing javac on your system is reading the code page from the console and using that as the default. Either that, or the default is a code page in which ë = 0xEB (e.g., CP1252 or ISO 8859-1, either of which might be the default depending on your configuration (as far as I know)).

Comments