Typo Typo - 6 months ago 25
Java Question

Java Inflate inconsistent with large Strings

I want to recover a compressed medium length string (665 chars) using

java.util.zip
package, The compression is made by this code:

public String compress(String s){
Deflater def = new Deflater(9);
byte[] buffer = new byte[s.length()];
String rta = "";
def.setInput(s.getBytes());
def.finish();
def.deflate(buffer);
rta = new String(buffer);
rta = rta.trim().concat("*" + Integer.toString(s.length()));
//this addition at the end is used to recover the original length of the string to dynamically create the buffer later on.
return rta;
}


And the code to decompress is this:

public String decompress(String s){
String rta = "";
Inflater inf = new Inflater();
byte[] buffer = separoArray(s, true).getBytes(); // This function returns the compressed string or the original length wheter true/false parameter
int len = Integer.valueOf(separoArray(s, false));
byte[] decomp = new byte[len];
inf.setInput(buffer);
try {
inf.inflate(decomp, 0, len);
inf.end();
} catch (DataFormatException e) {e.printStackTrace();}
rta = new String(decomp);
return rta;
}


And this are the original String and the decompressed one:

Original:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed rutrum imperdiet consequat. Nulla eu sapien tincidunt, pellentesque ipsum in, luctus eros. Nullam tristique arcu lorem, at fringilla lectus tincidunt sit amet. Ut tortor dui, cursus at erat non, interdum imperdiet odio. In hac habitasse platea dictumst. Nulla facilisi. Duis eget auctor nibh. Cras ante odio, dignissim et sem id, ultrices imperdiet erat. Aenean ut purus hendrerit, bibendum massa non, accumsan orci. Morbi quis leo sed mauris scelerisque vulputate. Fusce gravida facilisis ipsum pellentesque euismod. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae"

Decompressed:

"Lorem ipsuAdolor sit amet, consectetur adipiscing elit. Sed rutrsuAimperdiet consequat. Nulla eu sapien tincidunt, pellentesquem ipsuAin, luctus eros. Nullam tristiquemarcu lLore, at fringilla lectus tincidunt sit amet. Ut tortor dui, cursus at erat non, interdsuAimperdiet odsAimpeIn hac habitasse platea dius ms Nulla eufacilisi. Duierog odatus dunibh. Craat erddsAim, dignissim odsm ipd, ulistcesmperdiet odat n. Aenean ut pur athendreri pebibendAimmassaon, inacc msan orci. Morbi quierleodsmdmmausti sceleriuem ivulputate. Fusce gravideufacilisisipsuAinllentesquem ieuiemod. VeiqubulAin erddpsuAinlrimisipnufaucubus orciuctus erot ulistcesmposuereursbilia Cura"

The differences are visible, why is this happening?, what could I do to avoid it?

Answer

I consent with commenters, that a compressed string should better be byte[]. However with a single-byte encoding like ISO-8859-1 one might abusively convert between byte[] and String.

The following differs from your version, in that it explicitly indicates the encoding. For text UTF-8 is adequate to have no limits and cover the full Unicode range.

Note the usage of the deflate return value.

public static String compress(String s) {
    Deflater def = new Deflater(9);
    byte[] sbytes = s.getBytes(StandardCharsets.UTF_8);
    def.setInput(sbytes);
    def.finish();
    byte[] buffer = new byte[sbytes.length];
    int n = def.deflate(buffer);
    return new String(buffer, 0, n, StandardCharsets.ISO_8859_1)
            + "*" + sbytes.length;
}

public static String decompress(String s) {
    int pos = s.lastIndexOf('*');
    int len = Integer.parseInt(s.substring(pos + 1));
    s = s.substring(0, pos);

    Inflater inf = new Inflater();
    byte[] buffer = s.getBytes(StandardCharsets.ISO_8859_1);
    byte[] decomp = new byte[len];
    inf.setInput(buffer);
    try {
        inf.inflate(decomp, 0, len);
        inf.end();
    } catch (DataFormatException e) {
        throw new IllegalArgumentException(e);
    }
    return new String(decomp, StandardCharsets.UTF_8);
}