GLHF GLHF - 6 months ago 83
Python Question

Python 3 UnicodeEncodeError for characters and smileys in Tweets

I'm making a Twitter API, I get tweets about a specific word (right now it's 'flafel'). Everything is fine except this tweet


b'And when I\'m thinking about getting the chili sauce on my flafel
and the waitress, a Pinay, tells me not to get it cos "hindi yan
masarap."\xf0\x9f\x98\x82'


I use
print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))
to see tweets, but this one gives me UnicodeEncodeError every time and if I erase
decode()
from that line like
print ("Tweet info: {}".format(str(tweet.text).encode('utf-8'))
I can see the actual tweet like above, but I want to convert that
\xf0\x9f\x98\x82
part to a str. I tried everyting, every version of decodes-encodes etc. How can I solve this problem?

Edit: Well I just went to that user's Twitter account to see what is that non-ASCII part, and it turns out it's a smile:

enter image description here

Is it possible to convert that smiley?

Edit2: The codes are;

...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
q = "flafel",
result_type = "recent",
include_entities = True,
lang = "en").items():

print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))

Answer

The problem could arise at the moment you try to use the unicode character \U0001f602 on Windows. Python-3 is fine for converting it from utf-8 to full unicode an back again, but windows is not able to display it.

I tried this piece of code in different ways on a Windows 7 box:

>>> b = b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'
>>> u = b.decode('utf8')
>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
>>> print(u)

Are here is what happened:

  • in IDLE (Python GUI interpretor based on Tk), I got this error:

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 139-139: Non-BMP character not supported in Tk

  • in a console using a non unicode codepage I got this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f602' in position 139: character maps to <undefined>

(for the attentive reader BMP means here Basic Multilingual Plane)

  • in a console using utf-8 codepage (chcp 65001) I got no error but a weird display:

    >>> u
    'And when I\'m thinking about getting the chili sauce on my flafel and the waitr
    ess, a Pinay, tells me not to get it cos "hindi yan masarap."😂'
    >>> print(u)
    And when I'm thinking about getting the chili sauce on my flafel and the waitres
    s, a Pinay, tells me not to get it cos "hindi yan masarap."😂
    >>>
    

My conclusion is that the error in not in the conversion utf-8 <-> unicode. But it looks that Window Tk version does not support this character, nor any console code page (except for 65001 that simply tries to display the individual utf8 bytes!)

TL/DR: The problem is not in core Python processing nor in the UTF-8 converter, but only at the system conversion that is used to display the character '\U0001f602'

But hopefully, as core Python has no problem in it, you can easily change the offending '\U0001f602' with a ':D' for example with a mere string.replace (after the code shows above):

>>> print (u.replace(U'\U0001f602', ':D'))
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":D

If you want a special processing for all characters outside the BMP, it is enough to know that the highest code for it is 0xFFFF. So you could use code like that:

def convert(t):
    with io.StringIO() as fd:
        for c in t:  # replace all chars outside BMP with a !
            dummy = fd.write(c if ord(c) < 0x10000 else '!')
        return fd.getvalue()
Comments