nomoral nomoral - 5 months ago 23
Python Question

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

how does the unicode thing works on python2? i just dont get it.

here i download data from a server and parse it for JSON.

Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/hubs/poll.py", line 92, in wait
readers.get(fileno, noop).cb(fileno)
File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/greenthread.py", line 202, in main
result = function(*args, **kwargs)
File "android_suggest.py", line 60, in fetch
suggestions = suggest(chars)
File "android_suggest.py", line 28, in suggest
return [i['s'] for i in json.loads(opener.open('https://market.android.com/suggest/SuggRequest?json=1&query='+s+'&hl=de&gl=DE').read())]
File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
obj, end = self._scanner.iterscan(s, **kw).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 217, in JSONArray
value, end = iterscan(s, idx=end, context=context).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject
value, end = iterscan(s, idx=end, context=context).next()
File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
return scanstring(match.string, match.end(), encoding, strict)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data


thank you!!

EDIT: the following string causes the error:
'[{"t":"q","s":"abh\xf6ren"}]'
.
\xf6
should be decoded to
ö
(abhören)

Answer

The string you're trying to parse as a JSON is not encoded in UTF-8. Most likely it is encoded in ISO-8859-1. Try the following:

json.loads(unicode(opener.open(...), "ISO-8859-1"))

That will handle any umlauts that might get in the JSON message.

You should read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I hope that it will clarify some issues you're having around Unicode.

Comments