mahsa mahsa - 1 month ago 13
Python Question

I cannot read a file because I receive "UnicodeDecodeError: 'utf-8' codec can't decode" error

I have a file and want to convert it to utf8 encoding.

When I want to read, I receive this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 947: invalid continuation byte


My purpose was to read it and then convert it to utf8 encoding format, but it doesn't allow reading.

Here is my code:

#convert all files into utf_8 format
import os
import io
path_directory="some path string"
directory = os.fsencode(path_directory)
for file in os.listdir(directory):
file_name=os.fsdecode(file)
file_path_source=path_directory+file_name
file_path_dest="some address to destination file"
with open(file_path_source,"r") as f1:
text=f1.read()
with io.open(file_path_dest,"w+",encoding='utf8') as f2:
f2.write(text)
file_path=""
file_name=""
text=None


and the error is:

---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-47-59e5e52ddd40> in <module>()
10 with open(file_path,"r") as f1:
11 print(type(f1))
---> 12 text=f1.read()
13 with io.open(file_path.replace("ref_sum","ref_sum_utf_8"),"w+",encoding='utf8') as f2:
14 f2.write(text)

/home/afsharizadeh/anaconda3/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 947: invalid continuation byte


how can I convert my files to utf8 without reading them?

Answer Source

That is obvious . If you want to open a file and its not utf8 for python3(utf8 is default encoding for python3 and ascii for python2) then you have to mention the encoding you know the file is in while opening it :

io.open(file_path_dest,"r",encoding='ISO-8859-1')

In this case encoding is ISO-8859-1 so you have to mention it.