hash_ir hash_ir - 4 months ago 13
JSON Question

Weird characters printed during json key-value printing

I have made a python program to get the movie/tv show information using the OMDb API http://www.omdbapi.com/

I am getting an error while printing the running years of the tv show. Here's a part of the code where this is happening:

keys = ['Title', 'Year', 'imdbRating', 'Director', 'Actors', 'Genre', 'totalSeasons']

def jsonContent(self):
payload = {'t':self.title}
movie = requests.get(self.url, params = payload)
return movie.json()

def getInfo(self):
data = self.jsonContent()
for key, value in data.items():
if key in keys:
print key.encode('utf-8') + ' : ' + value.encode('utf-8')


For example if I search for How I Met Your Mother, it prints out like this:

totalSeasons : 9
Title : How I Met Your Mother
imdbRating : 8.4
Director : N/A
Actors : Josh Radnor, Jason Segel, Cobie Smulders, Neil Patrick Harris
Year : 2005ΓÇô2014 #problem here
Genre : Comedy, Romance


How can I fix this?

Answer

You are encoding Unicode text to UTF-8 before printing:

print key.encode('utf-8') + ' : ' + value.encode('utf-8')

Your console or terminal is not configured to interpret UTF-8 however. It is being sent bytes and it is then displaying characters based on a different codec altogether.

Your value contains a \u2013 or U+2013 EN DASH character, which encodes to UTF-8 as 3 bytes E2 80 93, which your terminal appears to decode as Windows Codepage 437 instead:

>>> value = u'2005\u20132014'
>>> print value
2005–2014
>>> print value.encode('utf8').decode('cp437')
2005ΓÇô2014

Either re-configure your console or terminal, or set the PYTHONIOENCODING environment variable to use an error handler:

PYTHONIOENCODING=cp437:replace

The :replace part will tell Python to encode to cp437 but to use placeholders for characters it can't handle. You'll get a question mark instead:

>>> print value.encode('cp437', 'replace')
2005?2014

Note that I have to encode to CP437 explicitly in all these examples. You don't as Python has detected your configuration and will do this automatically for you. Just stick to printing Unicode directly.

Another alternative is to use the Unicodecode package to replace non-ASCII characters with close approximations; it'll replace the en-dash with an ASCII dash:

>>> from unidecode import unidecode
>>> value
u'2005\u20132014'
>>> unidecode(value)
'2005-2014'
Comments