Mohammadreza Mohammadreza - 1 year ago 57
Python Question

encoding error in pos tagging with nltk 3.0 on python 3.4

I am using

NLTK 3.0
with Python 3.4 and cannot do POS tagging because of the following error:
I have read all similar posts related to similar problems, but could not find a way to solve the problem. most of the posts mention that upgrading to
NLTK 3.0
will solve the problem but I already have
NLTK 3.0
. According to these posts a change in the nltk's
data.py
solves the problem but
NLTK
people discourage doing that.
Here is my code:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
pos_tag(word_tokenize("John's big idea isn't all that bad."))


and here is the error:


UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)


Is there any way to do it without manipulating
data.py
?
Any idea would be appreciated.

Answer Source

In the current version of nltk_data, they provide two versions of the pickle files: one for Python 2 and one for Python 3. For example, there is one english.pickle at nltk_data/taggers/maxent_treebank_pos_tagger and one at nltk_data/taggers/maxent_treebank_pos_tagger/PY3. The newest nltk handles this automatically by a decorator py3_data.

In short, if you download the newest nltk_data, but don't have the newest nltk, it may load the wrong pickle file, raising the UnicodeDecodeError exception.

Note: suppose you already have the newest nltk, you may encounter some path error where you can see two "PY3"'s in the path of the pickle file. This may mean some developers were not aware of the py3_data and have handled the path redundantly. You can remove/revert the redundancy by yourself. See this pull request for an example.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download