I am using
NLTK 3.0
NLTK 3.0
NLTK 3.0
data.py
NLTK
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
pos_tag(word_tokenize("John's big idea isn't all that bad."))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
data.py
In the current version of nltk_data, they provide two versions of the pickle files: one for Python 2 and one for Python 3. For example, there is one english.pickle
at nltk_data/taggers/maxent_treebank_pos_tagger
and one at nltk_data/taggers/maxent_treebank_pos_tagger/PY3
. The newest nltk handles this automatically by a decorator py3_data
.
In short, if you download the newest nltk_data, but don't have the newest nltk, it may load the wrong pickle file, raising the UnicodeDecodeError
exception.
Note: suppose you already have the newest nltk, you may encounter some path error where you can see two "PY3"'s in the path of the pickle file. This may mean some developers were not aware of the py3_data
and have handled the path redundantly. You can remove/revert the redundancy by yourself. See this pull request for an example.