Paul Würtz Paul Würtz - 4 months ago 23
Python Question

Python read non-ascii text file

I am trying to load a text file, which contains some German letters with

content=open("file.txt","r").read()


which results in this error message

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)


if I modify the file to contain only ASCII characters everything works as expected.

Apperently using

content=open("file.txt","rb").read()


or

content=open("file.txt","r",encoding="utf-8").read()


both do the job.

Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?

Answer

ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.

When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \0xc3 character à is now read in without error. But despite it seeming to work, it's still not "correct".

If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.

It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.

A character like this: č

Will be read in as two bytes, but properly decoded, will be one character:

bytes = bytes('č', encoding='utf-8')

print(len(bytes))                   # 2
print(len(bytes.decode('utf-8')))   # 1