The correct way to load unicode text from Python 2.7 is something like:
content = open('filename').read().decode('encoding'):
for line in content.splitlines():
: No it isn't. See the answers.)
However, if the file is very large, I might want to read, decode and process it one line at a time, so that the whole file is never loaded into memory at once. Something like:
for line in open('filename'):
loop's iteration over the open filehandle is a generator that reads one line at a time.
This doesn't work though, because if the file is utf32 encoded, for example, then the bytes in the file (in hex) look something like:
hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)
And the split into lines done by the
loop splits on the
byte of the
character, resulting in (in hex):
lines = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines = 0x 000000
So part of the
character is left at the end of line 1, and the remaining three bytes end up in line 2 (followed by whatever text is actually in line 2.) Calling
on either of these lines understandably results in a
UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data
So, obviously enough, splitting a unicode byte stream on
bytes is not the correct way to split it into lines. Instead I should be splitting on occurrences of the full four-byte newline character (0x0a000000). However, I think the correct way to detect these characters is to decode the byte stream into a unicode string and look for
characters - and this decoding of the full stream is exactly the operation I'm trying to avoid.
This can't be an uncommon requirement. What's the correct way to handle it?