leonlai leonlai - 5 months ago 37
Java Question

Write 16 bits character to .xlsx file using Apache POI in Java

I have a problem in Apache POI.
The problem is, I try to put a 16 bits character value (such as CJK Unified Ideographs Extension B) to .xlsx file. However, the cell value become a question mark(like ????) in generated .xlsx file.

Anyone know how to handle the 16 bits character value in Apache POI with .xlsx format???

My POI version is 3.14

Code sample as below:

XSSFWorkbook workbook = new XSSFWorkbook();
XSSFSheet sheet = workbook.createSheet("Test");

XSSFRow row1 = sheet.createRow(0);
XSSFCell r1c1 = row1.createCell(0);
r1c1.setCellValue("

Answer

The problem exists. But not with 16 bit (2 byte) Unicode characters from 0x0000 to 0xFFFF. It is with characters which needs more than 2 byte in Unicode encoding. Those are the characters which where mentioned as Unicode code points in Java Character: "Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding." The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters (Characters whose code points are greater than U+FFFF) are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

The problem is with org.apache.xmlbeans.impl.store.Saver. This works with a private char[] _buf. But since char max value is 0xFFFF, Unicode codepoints from 0x10000 to 0x10FFFF are not possible to store in char. So the will be stored as a pair of char values.

There is a method

    /**
     * Test if a character is valid in xml character content. See
     * http://www.w3.org/TR/REC-xml#NT-Char
     */

    private boolean isBadChar ( char ch )
    {
        return ! (
            (ch >= 0x20 && ch <= 0xD7FF ) ||
            (ch >= 0xE000 && ch <= 0xFFFD) ||
            (ch >= 0x10000 && ch <= 0x10FFFF) ||
            (ch == 0x9) || (ch == 0xA) || (ch == 0xD)
            );
    }

That code is totally buggy since it checks if a char is between 0x10000 and 0x10FFFF. As mentioned this is not possible at all.

Also it excludes the high-surrogates range, (\uD800-\uDBFF) and the low-surrogates range (\uDC00-\uDFFF) as bad chars. So the code point representations as a pair of char values will be excluded.

So the problem results from a bug in org.apache.xmlbeans.impl.store.Saver.


Patch:

Goal: Not exclude the high-surrogates range, (\uD800-\uDBFF), and the low-surrogates range, (\uDC00-\uDFFF), as bad chars. So Unicode code points above U+10000, stored as two 16 bit chars will not be excluded in XML.

Download Saver.java. Change the private boolean isBadChar ( char ch ) to

    /**
     * Test if a character is valid in xml character content. See
     * http://www.w3.org/TR/REC-xml#NT-Char
     */
    private boolean isBadChar ( char ch )
    {
        return ! (
            (ch >= 0x20 && ch <= 0xFFFD ) ||
            (ch == 0x9) || (ch == 0xA) || (ch == 0xD)
            );
    }

in both static final class OptimizedForSpeedSaver and static final class TextSaver.

Compile Saver.java.

Store a backup of xmlbeans-2.6.0.jar somewhere outside the classpath.

Replace Saver$OptimizedForSpeedSaver.class and Saver$TextSaver.class in xmlbeans-2.6.0.jar -> /org/apache/xmlbeans/impl/store/ with the new compiiled ones.

Now Unicode code points above U+10000 will be stored in sharedStrings.xml.


Disclaimer: This is not well tested. So don't use this in productive. It is only shown here to describe the problem. Maybe some programmers on xmlbeans.apache.org will find the time to solve the problem with org.apache.xmlbeans.impl.store.Saver properly.

Comments