moisespedro moisespedro - 18 days ago 5
Python Question

Why isn't my script printing Unicode characters correctly?

I am working with Twitter data and I have a file with a bunch of tweets in it, one per line. Most of those tweets were written in Portuguese so they have special characters such as "é", "á", etc

I am trying to filter stop words from the file and tokenize the tweets but after I process my script does not print the special characters correctly.

Example:


AT_USER pra concurso público, tô entrando nessas agora porque emprego bom tá foda


Becomes:


[u'pra', u'concurso', u'p\xfablico', u't\xf4', u'entrando', u'nessas', u'agora', u'porque', u'emprego', u'bom', u't\xe1', u'foda']


Why do I have this "u" before each token? And why does "ú" becomes "\xfa"?

How do I get tokens without the "u" and with the accented characters printed correctly?

Here in this gist you can check the text before, after and the script I've used.

Thank you :)

Answer

You have a list

>>> l = [u'pra', u'concurso', u'p\xfablico', u't\xf4', u'entrando', u'nessas', u'agora', u'porque', u'emprego', u'bom', u't\xe1', u'foda']

And when you print the list, the words look weird

>>> print l
[u'pra', u'concurso', u'p\xfablico', u't\xf4', u'entrando', u'nessas', u'agora', u'porque', u'emprego', u'bom', u't\xe1', u'foda']

But if you print the words, it looks fine

>>> for word in l:
...     print word
... 
pra
concurso
público
tô
entrando
nessas
agora
porque
emprego
bom
tá
foda
>>> 

When you print a list, python prints a representation of the list that is good for programmers to see what the object is. Its got brackets and quotes and... a "u" to tell you its a unicode string instead of a regular ascii string. You see the ascii-escaped version of the unicode characters because that's the only way to view those characters in ascii. If you evaluate the printed string as a python command, you even get the original list back!

>>> l2 = eval("[u'pra', u'concurso', u'p\xfablico', u't\xf4', u'entrando', u'nessas', u'agora', u'porque', u'emprego', u'bom', u't\xe1', u'foda']")
>>> l == l2
True

All is well! You are just getting the geek-view of the list.

python 3 does a much better job at handling unicode. Unless you have a reason to stick with 2.x, move!