thomas mann thomas mann - 4 months ago 21
Python Question

Python ascii utf unicode

When I parse this XML with

p = xml.parsers.expat.ParserCreate()

<name>Fortuna D&#252;sseldorf</name>

The character parsing event handler includes

How can
be turned into

This is the main question in this post, the rest just shows further (ranting) thoughts about it

Isn't Python unicode broken since
shall yield
and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.

The only thing that I found works is: (This cannot be intended, right?)

exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')

Replacing 8859 with utf-8 fails! What is the point of that?

Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.

Unicode is no magic - why do so many ppl here have issues?

The underlying problem of unicode conversion is dirt simple:

One bidirectional lookup table '\xFC' <-> u'ü'

unicode( 'Fortuna D\xfcsseldorf' )

What is the reason why the creators of Python think it is better to show an error instead of simply producing this:
u'Fortuna Düsseldorf'

Also why did they made it not reversible?:

>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'


You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.

In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.

As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.

If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:

>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf

If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:

>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf

The alternative is for you upgrade to Python 3; there repr() only encodes codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc). The new ascii() function gives you the Python 2 repr() behaviour still.