Giovanni Giovanni - 7 months ago 33
Java Question

How to deal with java encoding problems (especially xml)?

I searched about java and encoding and I did not found a resource explaining how to deal with commons problems that arise in java when encoding and decoding strings.
There are a lot of specific questions about single errors but I did not found a wide response/reference guide to the problem.
The main questions are:

What is String encoding?

Why in Java can I read files with wrong charatecters?

Why when dealing with xml I got Invalid byte x of y-byte UTF-8 sequence Exception? What are the main causes and how to avoid them?

Answer

Since Stackoverflow encourages self answers I try to respond to myself.

Encoding is the process of converting data from one format to another, this response I details how String encoding works in Java (you may want to read this for a more generic introduction to text end encoding).

Introduction

String encoding/decoding is the process that transforms a byte[] into a String and vice-versa.

At a first sight you may think that there are no problems, but if we look more deeply to the process some issues may arise. At the lowest level information is stored/transmitted in bytes: files are a sequence of bytes and network communication is done by sending and receiving bytes. So every time you want to read or write a file with plain readable content or every time you submit a web form/read a web page there is an underlying encoding operation. Let's start from the basic String encoding operation in java; creating a String from a sequence of bytes. The following code converts a byte[] (the bytes may come from a file or from a socket) into a String.

    byte[] stringInByte=new byte[]{104,101,108,108,111};
    String simple=new String(stringInByte);
    System.out.println("simple=" + simple);//prints simple=hello

so far so good, all "simple". The value of the bytes are taken from here which shows one way to map letters and numbers to bytes Let's complicate the sample with a simple requirement the byte[] contains the € (euro) sign; oops, there is no euro symbol in the ascii table.

This can be roughly summarized as the core of the problem, the human readable characters (together with some other necessary ones such as carriage return, line feed, etc) are more than 256, i.e. it cannot be represented with only one byte. If for some reason you must stick with a single byte representation (i.e. historical reasons the first encoding tables were using only 7 bytes, space constraints reason, if the space on the disk is limited and you write text documents only for English people there is not need to include Italian letters with an accent such as è,ì) you have the problem of choosing which characters to represent.

Choosing an encoding is choosing a mapping between bytes and chars.

Coming back to the euro example and sticking with one byte --> one char mapping the ISO8859-15 encoding table has the € sign; The sequence of bytes for representing the string "hello €" is the following one

byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};

How do you "tell" to java which encoding to use for the conversion? The String has the constructor

String(byte[] bytes, String charsetName)

That allows to specify "the mapping" If you use different charsets you get different output results as you can see below:

    byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};
    String simple1=new String(stringInByte1,"ISO8859-15");
    System.out.println("simple1=" + simple1);  //prints simple1=hello €     

    String simple2=new String(stringInByte1,"ISO8859-1");
    System.out.println("simple2=" + simple2);   //prints simple1=hello ¤

So this explains why you read some characters and read different one the encoding used for writing (String to byte[]) is different from the one used for reading (byte[] to String). The same byte may map to different characters in different encoding so some characters may "look strange".
These are the basic concepts needed to understand String encoding; let's complicate the matter a little bit more. There may be the need to represent more than 256 symbols in one text document, in order to achieve this multi byte encoding have been created.

With multibyte encoding there is no more one byte --> one char mapping but there is sequence of bytes --> one char mapping

One of the most known multibyte encoding is UTF-8; UTF-8 is a variable length encoding, some chars are represented with one byte some others with more than one;

UTF-8 overlaps with some one byte encoding such as us7ascii or ISO8859-1; it can be viewed as an extension of one byte encoding.

Let see UTF-8 in action for the first example

    byte[] stringInByte=new byte[]{104,101,108,108,111};
    String simple=new String(stringInByte);
    System.out.println("simple=" + simple);//prints simple=hello

    String simple3=new String(stringInByte, "UTF-8");
    System.out.println("simple3=" + simple3);//also this prints simple=hello

As you can see trying the code it prints hello, i.e. the bytes to represent hello in UTF-8 and ISO8859-1 are the same.

But if you try the sample with the € sign you got a ?

    byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};
    String simple1=new String(stringInByte1,"ISO8859-15");
    System.out.println("simple1=" + simple1);//prints simple1=hello

    String simple4=new String(stringInByte1, "UTF-8");
    System.out.println("simple4=" + simple4);//prints simple4=hello ?

meaning that the char is not recognized and that there is an error. Note that you get no exception even if there is an error during the conversion.

Unfortunately not all java classes behave the same way when dealing with invalid chars; let see what happens when we deal with xml.

Managing XML

Before going through the examples is worth remembering that in Java InputStream/OutputStream read/write bytes and Reader/Writer read/write characters.

Let's try to read the sequence of bytes of a xml in some different ways, i.e reading files in order to get a String vs reading the file in order to get a DOM.

    //Create a xml file
    String xmlSample="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<specialchars>àèìòù€</specialchars>";
    try(FileOutputStream fosXmlFileOutputStreame= new FileOutputStream("test.xml")) {
        //write the file with a wrong encoding
        fosXmlFileOutputStreame.write(xmlSample.getBytes("ISO8859-15"));
    }

    try (
            FileInputStream xmlFileInputStream= new FileInputStream("test.xml");
            //read the file with the encoding declared in the xml header
            InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");
    ) {
        char[] cbuf=new char[xmlSample.length()];
        inputStreamReader.read(cbuf);
        System.out.println("file read with UTF-8=" + new String(cbuf)); 
        //prints
        //file read with UTF-8=<?xml version="1.0" encoding="UTF-8"?>
        //<specialchars>������</specialchars>
    }


    File xmlFile = new File("test.xml");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(xmlFile);     
    //throws  

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence

In the first case the result are some strange chars but no Exception, in the second case you get an exception (Invalid sequence....) The exception occurs because you are reading a three bytes char of a UTF-8 sequence and the second byte has an invalid value (because of the UTF-8 way of encoding chars).

The tricky part is that since UTF-8 overlaps with some other encoding the Invalid byte 2 of 3-byte UTF-8 sequence exceptions arise "random" (i.e. only for the messages with characters represented by more than one byte), so in production environment the error can be difficult to track and to reproduce.

With all these information we can try to answer to the following question:

Why do I get Invalid byte x of y-byte UTF-8 sequence Exception when reading/dealing with a xml file?

Because there is a mismatch from the encoding used for writing (ISO8859-15 in the test case above) and the encoding for reading (UTF-8 in the test case above); the mismatch may have some different causes:

  1. you are making some wrong conversion between bytes and char: for example if you are reading a file with a InputStream and converting into a Reader and passing the Reader to the xml library you must specify the charset name as in the following code (i.e. you must know the encoding used for saving the file)

    try ( FileInputStream xmlFileInputStream= new FileInputStream("test.xml"); //this is the reader for the xml library (DOM4J, JDOM for example) //UTF-8 is the file encoding if you specify a wrong encoding or you do not apsecify any encoding you may face Invalid byte x of y-byte UTF-8 sequence Exception InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8"); )

  2. you are passing the InputStream directly to xml library but the file the file is not correct (as in first the example of managing xml where the header states UTF-8 but the real encoding is ISO8859-15. Simply putting in the first line of the file is not enough; the file must be saved with the encoding used in the header.

  3. you are reading the file with a reader created without specifying an encoding and the platform encoding is different from file encoding:

    FileReader fileReader=new FileReader("text.xml");
    

This lead to one aspect that at least for me it is the source of the most of the String encoding problems in java: using the default platform encoding

When you call

"Hello €".getBytes();

you can get different results on different operating systems; this is because on windows the default encoding is Windows-1252 while on linux it may be UTF-8; the € char is encoded differently so you get not only different bytes but also different array sizes:

    String helloEuro="hello €";
    //prints hello euro byte[] size in iso8859-15 = 7
    System.out.println("hello euro byte[] size in iso8859-15 = " + helloEuro.getBytes("ISO8859-15").length);
    //prints hello euro byte[] size in utf-8 = 9
    System.out.println("hello euro byte[] size in utf-8 = " + helloEuro.getBytes("UTF-8").length);

Using String.getBytes() or new String(byte[] ...) without specifying an encoding is the first check to do when you run into encoding issues

The second one is checking if you are reading or writing files using FileReader or FileWriter; in both cases the documentation states:

The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable

As with String.getBytes() reading/writing the same file on different platforms with a reader/writer and without specifying the charset may lead to different byte sequence due to different default platform encoding

The solution, as the javadoc suggest is to use OutputStreamReader/OutputStreamWriter that wraps an OutputStream/InputStream together with a charset specification.

Some final notes on how some xml libraries read XML content:

  1. if you pass a Reader the library relies on the reader for the encoding (i.e. it does not check what the xml header says) and does not anything about encoding since it is reading chars not bytes.

  2. if you pass an InputStream or a File library relies on the xml header for the encoding and it may throw some encoding Exceptions

Database

A different issue may arise when dealing with databases; when a database is created it has an encoding property used to save the varchar and string column (as clob). If the database is created with a 8 bit encoding (ISO8859-15 for example) problems may arise when you try to insert chars not allowed by the encoding. What is saved on the db may be different from the string specified at Java level because in Java strings are represented in memory in UTF-16 which is "wider" than the one specified at the database level. The simplest solution is : create you database with a UTF-8 encoding.

web this is a very good starting point.

If you feel something is missing feel free to ask for something more in the comments.