Paul Würtz Paul Würtz - 1 year ago 115
Python Question

Python read non-ascii text file

I am trying to load a text file, which contains some German letters with


which results in this error message

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)

if I modify the file to contain only ASCII characters everything works as expected.

Apperently using




both do the job.

Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?

Answer Source

ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.

When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character à is now read in without error. But despite it seeming to work, it's still not "correct".

If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.

It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.

A character like this: č

Will be read in as two bytes, but properly decoded, will be one character:

bytes = bytes('č', encoding='utf-8')

print(len(bytes))                   # 2
print(len(bytes.decode('utf-8')))   # 1
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download