Rob Rob - 4 months ago 12
Python Question

python unicode errors convert to printed values

If I have some unicode like this:

'\x00B\x007\x003\x007\x00-\x002\x00,\x001\x00P\x00W\x000\x000\x009\x00,\x00N\x00O\x00N\x00E\x00,\x00C\x00,\x005\x00,\x00J\x00,\x00J\x00,\x002\x009\x00,\x00G\x00A\x00R\x00Y\x00,\x00 \x00W\x00I\x00L\x00L\x00I\x00A\x00M\x00S\x00,\x00 \x00P\x00A\x00R\x00E\x00N\x00T\x00I\x00,\x00 \x00F\x00I\x00N\x00N\x00E\x00Y\x00 \x00&\x00 \x00L\x00E\x00W\x00I\x00S\x00,\x00U\x00S\x00,\x001\x00\r\x00'


and it's read in from a csv in string format, but I'd like to convert it to a human readable form. It works when I print it, but I can't seem to figure out the approach command to make save it to a variable in human readable form. What is the best approach?

Answer

You don't have Unicode. Not yet. You have a series of bytes, and those bytes use the UTF-16 encoding. You need to decode those bytes first:

data.decode('utf-16-be')

Printing it works only because your console ignores most of the big-endian pair of each UTF-16 codeunit.

Your data is missing a Byte order mark, so I had use the utf-16-be, or big endian variant of UTF-16, on the assumption that you cut the data at the right byte. It could also be little endian if you didn't.

As it is I had to remove the last \x00 null byte to make it decode; you pasted an odd, rather than an even number of bytes, as you cut one UTF-16 code unit (each 2 bytes) in half:

>>> s = '\x00B\x007\x003\x007\x00-\x002\x00,\x001\x00P\x00W\x000\x000\x009\x00,\x00N\x00O\x00N\x00E\x00,\x00C\x00,\x005\x00,\x00J\x00,\x00J\x00,\x002\x009\x00,\x00G\x00A\x00R\x00Y\x00,\x00 \x00W\x00I\x00L\x00L\x00I\x00A\x00M\x00S\x00,\x00 \x00P\x00A\x00R\x00E\x00N\x00T\x00I\x00,\x00 \x00F\x00I\x00N\x00N\x00E\x00Y\x00 \x00&\x00 \x00L\x00E\x00W\x00I\x00S\x00,\x00U\x00S\x00,\x001\x00\r\x00'
>>> s[:-1].decode('utf-16-be')
u'B737-2,1PW009,NONE,C,5,J,J,29,GARY, WILLIAMS, PARENTI, FINNEY & LEWIS,US,1\r'

The file you read this from probably contains the BOM as the first two bytes. If so, just tell whatever you use to read this data to use utf-16 as the codec, and it'll figure out the right variant from those first bytes.

If you are using Python 2 you'd want to study the Examples section of the csv module for code that can re-code your data in a form suitable for that module; if you include the UnicodeReader from that section you'd use it like this:

with open(yourdatafile) as inputfile:
    reader = UnicodeReader(inputfile, encoding='utf-16')
    for row in reader:
        # row is now a list with unicode strings

Demo:

>>> from StringIO import StringIO
>>> import codecs
>>> f = StringIO(codecs.BOM_UTF16_BE + s[:-1])
>>> r = UnicodeReader(f, encoding='utf-16')
>>> next(r)
[u'B737-2', u'1PW009', u'NONE', u'C', u'5', u'J', u'J', u'29', u'GARY', u' WILLIAMS', u' PARENTI', u' FINNEY & LEWIS', u'US', u'1']

If you are using Python 3, simply set the encoding parameter to the open() function to utf-16 and use the csv module as-is.