rivfaader rivfaader - 21 days ago 9
Python Question

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,


Traceback (most recent call last):

File "./test.py", line 360, in

ofile.write(u'\t\t\t\t\t\n')

UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)


Although I explicitly cast the
word
variable to unicode type (
type(word)
returned unicode), I tried to encode it with
.encode('utf-8)
I'm still stuck with this error.

Here is a sample of the code as it looks now:

for word in word_list:
word = unicode(word)
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...


I also tried the following:

for word in word_list:
word = word.encode('utf-8')
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...


Even the combination of these two:

word = unicode(word)
word = word.encode('utf-8')


I was kind of desperate so I even tried to encode the word variable inside the
ofile.write()


ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')


I would appreciate any hints of what I'm doing wrong.

Answer

ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:

>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
                    ordinal not in range(128)

Make ofile a text stream by opening the file with io.open, with a mode like 'wt', and an explicit encoding:

>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L

Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode.

Comments