lemontree lemontree - 1 year ago 103
JSON Question

Python: print unicode string stored as a variable

In Python (3.5.0), I'd like to print a string containig unicode symbols (more precisely, IPA symbols retrieved from Wiktionary in JSON format) to the screen or a file, e.g.


correctly prints


- however, whenever I use the string in a variable, e.g.

ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'

it just prints out the string as-is, i.e.


which isn't of much help.

I have tried out several ways to avoid this (like going via
) but non of that helped.

I cannot work with


either since I am already retrieving the string as a variable (as the result of a regex-match) and at no point in my code enter the actual literals.

It might as well be that I made a mistake during the conversion from the JSON result; by now I have converted the byte stream into a string using
, extracted the IPA part via regex (and done a replace on the double backslashes) and stored it in a string variable.


This is the code I had so far:

def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsomatch: " + jsonmatch)
ipa = jsonmatch.replace("\\\\", "\\")
#print("ipa: " + ipa)

After modification with

def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsonmatch: " + jsonmatch)
jsonstr = "\"" + jsonmatch + "\""
#print("jsonstr: " + jsonstr)
jsonloads = json.loads(jsonstr)
#print("jsonloads: " + jsonloads)

For both versions, when calling it with


what I get is:


Is there any way to have the string printed/written as already decoded, even when passed as a variable?

Answer Source

You don't have this value:

ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'

because that value prints just fine:

>>> ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
>>> print(ipa)

You at the very least have literal \ and u characters:

ipa = '\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n'

Those \\ sequences are one backslash each, but escaped. Since this is JSON, the string is probably also surrounded by double quotes:

ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'

Because that string has literal backslashes, that is exactly what is being printed:

>>> ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ipa)
>>> ipa[1]
>>> print(ipa[1])
>>> ipa[2]

Note how the value echoed shows a string literal you can copy and paste back into Python, so the \ character is escaped again for you.

That value is valid JSON, which also uses \uhhhh escape sequences. Decode it as JSON:

import json


Now you have a proper Python value:

>>> import json
>>> json.loads(ipa)
>>> print(json.loads(ipa))

Note that in Python 3, almost all codepoints are printed directly even when repl() creates a literal for you. The json.loads() result directly shows all text in the value, even though the majority is non-ASCII.

This value does not contain literal backslashes or u characters:

>>> result = json.loads(ipa)
>>> result[0]
>>> result[1]

As a side note, when debugging issues like this, you really want to use the repr() and ascii() functions so you get representations that let you properly reproduce the value of a string:

>>> print(repr(ipa))
>>> print(ascii(ipa))
>>> print(repr(result))
>>> print(ascii(result))

Note that only ascii() on a string with actual Unicode codepoints beyond the Latin-1 range produces actual \uhhhh escape sequences.

As for your update, just parse the whole response as JSON, and load the right data from that. Your code instead converts the bytes response body to a repr() (the str() call on bytes does not decode the data; instead you doubly escape escapes this way). Decode the bytes from the network as UTF-8, then feed that data to json.loads():

import json
import re
import urllib.request
from urllib.parse import quote_plus

baseurl = "https://en.wiktionary.org/w/api.php?action=query&titles={}&prop=revisions&rvprop=content&format=json"

def getIPAen(word):
    url = baseurl.format(quote_plus(word))
    jsondata = urllib.request.urlopen(url).read().decode('utf8')
    data = json.loads(jsondata)
    for page in data['query']['pages'].values():
        for revision in page['revisions']:
            if 'IPA' in revision['*']:
                ipa = re.search(r"{IPA\|/(.*?)/\|", revision['*']).group(1)

Note that I also make sure to quote the word value into the URL query string.

The above prints out any IPA it finds:

>>> getIPAen('watermelon')
>>> getIPAen('chocolate')