jmc34 jmc34 -4 years ago 121
Java Question

How to sanitise a string before printing it to PDF with PDFBox

I am creating PDF documents from user inputs that are UTF-8.

Beyond displaying the PDFs, the creation itself fails with

java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding
.

Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).

The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.

The issue is that I cannot find valid example to achieve this.
All examples seem to be pointing at removed code (
font.setToUnicode
or some method in encoding to convert characters one at a time).

So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).

Many thanks,
JM

Answer Source

I ended doing a character by character sanitization.

Here what my sanitization function looks like.

To avoid reprocessing characters, I am caching the availability of each character for each given font.

When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

There might be a lot of things to improve but this works for my use case.

private String getPrintableString(String string, PDFont font) {

    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < string.length(); i++) {

        int codePoint = string.codePointAt(i);

        if (codePoint == 0x000A) {
            sb.appendCodePoint(codePoint);
            continue;
        }

        String fontName = font.getName();
        int cpKey = fontName.hashCode();
        cpKey = 31 * cpKey + codePoint;

        if (codePointAvailCache.get(cpKey) == null) {

            try {
                font.encode(string.substring(i, i + 1));
                codePointAvailCache.put(cpKey, true);
            } catch (Exception e) {
                codePointAvailCache.put(cpKey, false);
            }
        }

        if (!codePointAvailCache.get(cpKey)) {

            // Need to make sure our font has a replacement character
            try {
                codePoint = 0xFFFD;
                font.encode(new String(new int[] { codePoint }, 0, 1));
            } catch (Exception e) {
                codePoint = 0x003F;
            }
        }

        sb.appendCodePoint(codePoint);
    }

    return sb.toString();
}
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download