Tim Petri Tim Petri - 1 year ago 64
Python Question

How can I decode a utf-8 byte array to a string in Python2?

I have an array of bytes representing a utf-8 encoded string. I want to decode these bytes back into the string in Pyton2. I am relying on Python2 for my overall program, so I can not switch to Python3.

array = [67, 97, 102, **-61, -87**, 32, 70, 108, 111, 114, 97]

-> Café Flora

Since every character in the string I want is not necessarily represented by exactly 1 byte in the array, I can not use a solution like:

"".join(map(chr, array))

I tried to create a function that would step through the array, and whenever it encounters a number not in the range 0-127 (ASCII), create a new 16 bit int, shift the current bits over 8 to the left, and then add the following byte using a bitwise OR. Finally it would use unichr() to decode it.

result = []

for i in range(len(byte_array)):
x = byte_array[i]
if x < 0:
b16 = x & 0xFFFF # 16 bit
b16 = b16 << 8
b16 = b16 | byte_array[i+1]

return "".join(result)

However, this was unsuccessful.

The following article explains the issue very well, and includes a nodeJS solution:


Answer Source

you can use struct.pack for this

>>> a =  [67, 97, 102, -61, -87, 32, 70, 108, 111, 114, 97]
>>> struct.pack("b"*len(a),*a)
'Caf\xc3\xa9 Flora'
>>> print struct.pack("b"*len(a),*a).decode('utf8')
Café Flora