Aditya Aditya - 2 months ago 5
Python Question

How to solve this weird python encoding issue?

I'm doing some NLP task on a corpus of strings from the web - and as you expect, there are encoding issues. Here're a few examples:

they don’t serve sushi : the apostrophe in don't is not standard ' but \xe2\x80\x99
Delicious food – Wow : the hyphen before wow is \xe2\x80\x93

So now, I'm gonna read such lines, pass them to NLTK for parsing, use the parse information to train a CRF model through mallet.

Let's begin with the solution I've been seeing everywhere on stack-overflow. Here're a few experiments:-

st = "they don’t serve sushi"

Out[2]: 'they don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t serve sushi'

Out[3]: u'they don\u2019t serve sushi'

So these are just trial-and-error attempts to see if something might work.

I finally used the encoded sentence and passed it to the next part - pos tagging using nltk.
posTags = nltk.pos_tag(tokens)
and it throws an ugly exception known to everyone :-

File "C:\Users\user\workspacePy\_projectname_\CRF\", line 95, in getSentenceFeatures
posTags = nltk.pos_tag(tokens)
File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\", line 101, in pos_tag
return tagger.tag(tokens)
File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\", line 61, in tag
tags.append(self.tag_one(tokens, i, tags))
File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\", line 81, in tag_one
tag = tagger.choose_tag(tokens, index, history)
File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\", line 634, in choose_tag
featureset = self.feature_detector(tokens, index, history)
File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\", line 736, in feature_detector
'prevtag+word': '%s+%s' % (prevtag, word.lower()),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

And when I tried decoding, it says
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 42: ordinal not in range(128)
in the line where I'm decoding the string.

So my current solution is to remove all the non-ascii characters. But it totally changes the word which causes a serious loss of data for unigram-bigram (word combination) based model.

What should be the right approach?


In your example st is a str (list of bytes). To do that it was encoded in some form (utf8 by the looks), but think of it as a list of bytes, and you need to know how it was encoded in order to decode it (though utf8 is always generally a good first punt).

>>> st = "they don’t serve sushi"
>>> st
'they don\xe2\x80\x99t serve sushi'
>>> type(st)
<type 'str'>

>>> st.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

So st.encode is non-sensical here. It's already encoded (as utf8 by the interpreter by the looks of things). For some mad reason, in python2 str.encode will first decode into a unicode and then encode back to a str. It chooses to decode as ascii by default, but your data is encoded as utf8. So the error you're seeing is in the decode step of your encode operation! It's looking at that list of bytes e2,80,99 and saying - 'hmmm, those aren't real ascii characters'.

Let's start with unicode data instead (notice the u):

>>> st = u"they don’t serve sushi"
>>> st
u'they don\u2019t serve sushi'
>>> type(st)
<type 'unicode'>
>>> st.encode('utf8')
'they don\xe2\x80\x99t serve sushi'

Really, all this is python2's fault. Python3 won't let you get away with these shenanigans of thinking of unicode and str as the same thing.

The rule of thumb is; always work with unicode within your code. Only encode/decode when you're getting data in and out of the system, and generally, encode as utf8 unless you have some other specific requirement.

In python2 you can ensure that 'data' in your code is automatically unicode u'data'

from __future__ import unicode_literals

>>> st = "they don’t serve sushi"
>>> st
u'they don\u2019t serve sushi'
>>> type(st)
<type 'unicode'>