user5507059 user5507059 - 4 months ago 24
Python Question

Python 3: Read UTF-8 file containing German umlaut

I searched and found many similar questions and articles but none would allow me to resolve the issue.

I use Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on Windows 10.

I have a simple text file which is encoded for Windows in UTF-8 like so:

Notepadd++ Screenshot

All I want to do is to read the content of this file into a Python string and display it correctly in, say, the standard console.

Here is a first attempt that fails miserably:

file_name=r'c:\temp\encoding_test.txt'
fh=open(file_name,'r')
f_str=fh.read()
fh.close()
print(f_str)


The print-statement raises an exception:


'charmap' codec can't encode character '\u201e' in position 100: character maps to undefined


Using a debugger, f_str contains the following:


'I would like the following characters to display correctly after reading this file into Python:\n\nÄÖÜäöüß\n'


This is already very puzzling to me. Doesn't Python 3 use UTF-8 as a default everywhere? What other encoding would work? I tried all of the ones Notepad++ supports, none works.

OK, a bit more sophisticated, I tried:

import codecs
file_name=r'c:\temp\encoding_test.txt'
my_encoding='utf-8'
fh=codecs.open(file_name,'r',encoding=my_encoding)
f_str=fh.read().encode(my_encoding)
fh.close()
print(f_str)


This does not raise an exception, at least, but yields


b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'
I


This is a complete mess to me. Can anyone here please help me sort this out?

Answer

You are encoding to bytes after using codecs.open , just printing the data should give you want as you can see when we decode back:

In [31]: s = b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'

In [32]: print(s)
b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'

In [33]: print(s.decode("utf-8"))
I would like the following characters to display correctly after reading this file into Python:

ÄÖÜäöüß

If you are not seeing the correct output then it is your shell encoding that is the problem. The windows console encoding is not utf-8 so where you are running the code from and the shell encoding matters.