Ionescu Robert Ionescu Robert - 1 year ago 302
Python Question

Gzip Python 3 vs Gzip Python 2

The problem: I have an older code that it's using Py2 'str' and that is using gzip to compress that string and I want to have the same output from gzip from the same string in Py3 but I can't manage to make it work.

Python 2 code

#input_buffer is a str
string_buffer = StringIO()
gzip_file = GzipFile(fileobj=string_buffer, mode='w', compresslevel = 6)
out_buffer = string_buffer.getvalue()

Now I tried to migrate the same code in Py3 and expect the exact same result

Python 3 code

#input_buffer is a the exact same string that I have on Py2
string_buffer = BytesIO()
gzip_file = GzipFile(fileobj=string_buffer, mode=u'w', compresslevel = 6)
gzip_file.write(bytes(input_buffer, 'utf-8'))
out_buffer = string_buffer.getvalue()

What I've noticed is that once I make the 'str' a Bytes array it adds extra characters, characters that are later compressed and seen in the final result, even after I decode the code. Also decoding without 'ignore' flag will fail because some characters are bigger than expected.

Any solution for my problem?

To summarize: I have a str and I want from Py2 and Py3 gzip compression to have the exact same output. In practice it doesn't work at least from what I've tried.


One problem that I see is that even though they have the same values they are represented different and the only way I want the result to look like is like in Python2

out_buffer =b'\x1f\x8b\x08\x00\x00x\xb0X\x02\xff\xd3\xe6b\xf4\x14rIMK,\xcd)\x89\x0f\xce/-JN=\xb4R%\xd90\xcd\xd8\xd8\xd0\xccX7-\rH\x18\x1a\xa7\x9a\xe9&\xa5\x98\x9b\xe8\xa6X\x1a\xa4\x98\x99\xa7\x19\x19%&\x9b\x1c\x9e\xc8v\xa8\xe1\xd0\xdcC\xbb\x0e\xcd?\xdc\x13vx\x02;\xd3\xe1n\x0e.FM\x15\x00\x03&\xcf\x15S\x00\x00\x00'

out_buffer ='\x1f\x8b\x08\x00\xae|\xb0X\x02\xff\xd3\xe6b\xf4\x14rIMK,\xcd)\x89\x0f\xce/-JN]\xa9\x92l\x98fllhf\xac\x9b\x96\x06$\x0c\x8dS\xcdt\x93R\xccMtS,\rR\xcc\xcc\xd3\x8c\x8c\x12\x93M.\xb25\xcc\xdd5\xffL\xd8\x05v\xa6\xd3\x1c\\\x8c\x9a*\x00\xe9l\xf0\xeaJ\x00\x00\x00'

Answer Source

In Python2 input_buffer are bytes, and the character encoding is latin1. In Python3 you have a string, with unicode, which you encode as utf-8. To get the same result, you have to encode in Python 3 to latin1:

input_buffer = '+\n\x01I\x12Default_Source©$c1f33163-ff63-13e6-bd74-d90d67f22ac4Ñ\x06\x80\x9dº\x9fÌVÐ\x07\x02Ë\x08\n\x01)$'
string_buffer = BytesIO()
with GzipFile(fileobj=string_buffer, mode='w', compresslevel=6) as gzip_file:
    gzip_file.write(bytes(input_buffer, 'latin1'))
out_buffer = string_buffer.getvalue()
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download