I have a corpus of different texts from different languages.
I want to capture all characters. I use python 2.7 and defaultencodingsetting is utf-8.
I do not know why when I use this code for German umlaut it prints out German umlaut correctly :
You already have UTF-8 encoded data. There are no string literal characters to escape in your bytestring. You are looking at the
repr() output of a string where non-printable ASCII characters are shown as escape sequences because that makes the value easily copy-pastable in an ASCII-safe way. The
\xc3 you see is one byte, not separate characters:
>>> 'Erd\xC3\xA4pfel' 'Erd\xc3\xa4pfel' >>> 'Erd\xC3\xA4pfel' '\xc3' >>> 'Erd\xC3\xA4pfel' '\xa4' >>> print 'Erd\xC3\xA4pfel' Erdäpfel
You'd have to use a raw string literal or doubled backslashes to actually getting escape sequences that
unicode-escape would handle:
>>> '\\xc3\\xa4' '\\xc3\\xa4' >>> '\\xc3\\xa4' '\\' >>> '\\xc3\\xa4' 'x' >>> '\\xc3\\xa4' 'c' >>> '\\xc3\\xa4' '3' >>> print '\\xc3\\xa4' \xc3\xa4
Note how there is a separate
\ backslash character in that string (echoed as
\\, escaped again).
Next to interpreting actual escape sequences he
unicode-escape decodes your data as Latin-1, so you end up with a Unicode string with the character U+00C3 LATIN CAPITAL LETTER A WITH TILDE in it. Encoding that back to Latin-1 bytes gives you the
\xC3 byte again, and you are back to UTF-8 bytes. Decoding then as UTF-8 works correctly.
But your second attempt encoded the U+00C3 LATIN CAPITAL LETTER A WITH TILDE codepoint to UTF-8, and that encoding gives you the byte sequence
\xc3\x83. Printing those bytes to your UTF-8 terminal will show the
Ã character. The other byte,
\xA4 became U+00A4 CURRENCY SIGN, and the UTF-8 byte sequence for that is
\xc2\xa4, which prints as
There is absolutely no need to decode as
unicode-escape here. Just leave the data as is. Or, perhaps, decode as UTF-8 to get a
>>> 'Erd\xC3\xA4pfel'.decode('utf8') u'Erd\xe4pfel' >>> print 'Erd\xC3\xA4pfel'.decode('utf8') Erdäpfel
If your actual data (and not the test you did) contains
\xhh escape sequences that encode UTTF-8, then don't use
unicode-escape to decode those sequences either. Use
string-escape so you get a byte string containing UTF-8 data (which you can then decode to Unicode as needed):
>>> 'Erd\\xc3\\xa4pfel' 'Erd\\xc3\\xa4pfel' >>> 'Erd\\xc3\\xa4pfel'.decode('string-escape') 'Erd\xc3\xa4pfel' >>> 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8') u'Erd\xe4pfel' >>> print 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8') Erdäpfel