When I parse this XML with
p = xml.parsers.expat.ParserCreate()
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
unicode( 'Fortuna D\xfcsseldorf' )
>>> u'Fortuna Düsseldorf'.encode('utf-8')
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling
repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by
\uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
ü has been replaced by
\xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use
>>> u'Fortuna Düsseldorf' u'Fortuna D\xfcsseldorf' >>> print u'Fortuna Düsseldorf' Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8') 'Fortuna D\xc3\xbcsseldorf' >>> print u'Fortuna Düsseldorf'.encode('utf8') Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there
repr() only encodes codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc). The new
ascii() function gives you the Python 2
repr() behaviour still.