pie3636 pie3636 - 18 days ago 9
Python Question

Unicode-escaped file processing error

I have a raw text file containing only the following line, and no newline:

Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439


The characters are escaped as shown above, meaning that the
\u05E9
is really a backslash, followed by 5 alphanumeric characters (and not an Unicode character). I am trying to decode the file using the following code:

import codecs

with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
with open("wikidata-terms3.nt", "w") as output:
for line in input:
output.write(line)


Using
print
is not possible here, see in the comments.

Running it gives me the following error:

Traceback (most recent call last):
File "terms2.py", line 5, in <module>
for line in input:
File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__
return next(self.reader)
File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__
line = self.readline()
File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape


What is going on?

I am running Python 3.5.1 on Windows 8.1, and the code seems to work for most other Unicode characters (this line is the first one to cause the crash).

See edit history for the original question.

Answer

It seems that the data read by the decoder is truncated at (after) character#72 (0-based character #71). That obviously is somehow related to the this bug.

The following code produces the same error as in your example:

open("wikidata-terms20.nt", 'r').readline()
open("wikidata-terms20.nt", 'r').readline(72)

Increasing the readline size above the actual size of the input or setting it to -1 eliminates the error:

open("wikidata-terms20.nt", 'r').readline(1000)
open("wikidata-terms20.nt", 'r').readline(-1)

Evidently, for line in input: obtains the line to be decoded with readline(), effectively truncating the data-to-be-decoded to 72 characters.

So here are a couple of workarounds:

Workaround 1:

import codecs

with open("wikidata-terms20.nt", 'r') as input:
    with open("wikidata-terms3.nt", "w") as output:
        for line in input:
            output.write(codecs.decode(line, 'unicode_escape'))

Workaround 2:

import codecs

with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
    with open("wikidata-terms3.nt", "w") as output:
        for line in input.readlines():
            output.write(line)