Kaggle Kaggle - 1 month ago 10x
Python Question

How to capture all letters from different languages in python?

I have a corpus of different texts from different languages.

I want to capture all characters. I use python 2.7 and defaultencodingsetting is utf-8.

I do not know why when I use this code for German umlaut it prints out German umlaut correctly :


Result is:

but when I use this code :


result is :
Erdäpfel which is different.

I am not familiar with text mining.I know that for example latin1 encoding does not contain French letter which is not desired in my project.
How can I convert all unicode escape strings in my corpus regardless of their language to respective character?

Utf-8 according to documentations contains all languages but why it does not print out German umlaut correctly while latin1 encoding prints out correctly?

PS: Lowercase in unicode escaping characters sequences is not the case. I have tried both and results were the same.


You already have UTF-8 encoded data. There are no string literal characters to escape in your bytestring. You are looking at the repr() output of a string where non-printable ASCII characters are shown as escape sequences because that makes the value easily copy-pastable in an ASCII-safe way. The \xc3 you see is one byte, not separate characters:

>>> 'Erd\xC3\xA4pfel'
>>> 'Erd\xC3\xA4pfel'[3]
>>> 'Erd\xC3\xA4pfel'[4]
>>> print 'Erd\xC3\xA4pfel'

You'd have to use a raw string literal or doubled backslashes to actually getting escape sequences that unicode-escape would handle:

>>> '\\xc3\\xa4'
>>> '\\xc3\\xa4'[0]
>>> '\\xc3\\xa4'[1]
>>> '\\xc3\\xa4'[2]
>>> '\\xc3\\xa4'[3]
>>> print '\\xc3\\xa4'

Note how there is a separate \ backslash character in that string (echoed as \\, escaped again).

Next to interpreting actual escape sequences he unicode-escape decodes your data as Latin-1, so you end up with a Unicode string with the character U+00C3 LATIN CAPITAL LETTER A WITH TILDE in it. Encoding that back to Latin-1 bytes gives you the \xC3 byte again, and you are back to UTF-8 bytes. Decoding then as UTF-8 works correctly.

But your second attempt encoded the U+00C3 LATIN CAPITAL LETTER A WITH TILDE codepoint to UTF-8, and that encoding gives you the byte sequence \xc3\x83. Printing those bytes to your UTF-8 terminal will show the à character. The other byte, \xA4 became U+00A4 CURRENCY SIGN, and the UTF-8 byte sequence for that is \xc2\xa4, which prints as ¤.

There is absolutely no need to decode as unicode-escape here. Just leave the data as is. Or, perhaps, decode as UTF-8 to get a unicode object:

>>> 'Erd\xC3\xA4pfel'.decode('utf8')
>>> print 'Erd\xC3\xA4pfel'.decode('utf8')

If your actual data (and not the test you did) contains \xhh escape sequences that encode UTTF-8, then don't use unicode-escape to decode those sequences either. Use string-escape so you get a byte string containing UTF-8 data (which you can then decode to Unicode as needed):

>>> 'Erd\\xc3\\xa4pfel'
>>> 'Erd\\xc3\\xa4pfel'.decode('string-escape')
>>> 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8')
>>> print 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8')