Jonathan Hartley Jonathan Hartley - 9 days ago 4x
Python Question

How do I decode unicode one line at a time in Python 2.7?

The correct way to load unicode text from Python 2.7 is something like:

content = open('filename').read().decode('encoding'):
for line in content.splitlines():

(Update: No it isn't. See the answers.)

However, if the file is very large, I might want to read, decode and process it one line at a time, so that the whole file is never loaded into memory at once. Something like:

for line in open('filename'):

loop's iteration over the open filehandle is a generator that reads one line at a time.

This doesn't work though, because if the file is utf32 encoded, for example, then the bytes in the file (in hex) look something like:

hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)

And the split into lines done by the
loop splits on the
byte of the
character, resulting in (in hex):

lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines[1] = 0x 000000

So part of the
character is left at the end of line 1, and the remaining three bytes end up in line 2 (followed by whatever text is actually in line 2.) Calling
on either of these lines understandably results in a

UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data

So, obviously enough, splitting a unicode byte stream on
bytes is not the correct way to split it into lines. Instead I should be splitting on occurrences of the full four-byte newline character (0x0a000000). However, I think the correct way to detect these characters is to decode the byte stream into a unicode string and look for
characters - and this decoding of the full stream is exactly the operation I'm trying to avoid.

This can't be an uncommon requirement. What's the correct way to handle it?


How about trying somethng like:

for line in"filename", "rt", "utf32"):
    print line

I think this should work.

The codecs module should do the translation for you.