famfamfam famfamfam - 3 months ago 9
Java Question

Java, Using Scanner to input characters as UTF-8, can't print text

I can convert String to Array as UTF-8, but I can't convert it back to String like the first String.

public static void main(String[] args) {

Scanner h = new Scanner(System.in);
System.out.println("INPUT : ");
String stringToConvert = h.nextLine();
byte[] theByteArray = stringToConvert.getBytes();

System.out.println(theByteArray);
theByteArray.toString();
String s = new String(theByteArray);

System.out.println(""+s);
}


How do I print
theByteArray
as a String?

Joe Joe
Answer
String s = new String(theByteArray);

should really be

String s = new String(theByteArray, Charset.forName("UTF-8"));

The underlying issue here is that String constructors aren't smart. The String constructor cannot distinguish the charset that is being used and will try to convert it using the system standard which is generally something like ASCII or ISO-8859-1. This is why normal A-Za-z looks proper but then everything else begins to fail.

byte is a type that runs from -127 to 127 thus for UTF-8 conversion consecutive bytes need to be concatenated. It's impossible for the String constructor to distinguish this off a byte array so it will handle each byte individually by default (thus why basic alphanumeric will always work as they fall into this range).

Example:

String text = "こんにちは";
byte[] array = text.getBytes("UTF-8");
String s = new String(array, Charset.forName("UTF-8"));
System.out.println(s); // Prints as expected
String sISO = new String(array, Charset.forName("ISO-8859-1")); // Prints 'ããã«ã¡ã¯'
System.out.println(sISO);
Comments