mel mel - 8 days ago 5
Python Question

Encoding issue using NLTK

I'm trying to crawl a very 'right side' website for my research about hate and racism detection, so the content of my test may be offending.

I'm trying to remove some stopwords and punctuation in python and I'm using NLTK but I met a problem of encoding... I'm using python 2.7 and the data come from a file that I fill with article from the website I crawled:

stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
print type(value), value
tokenized_article = nltk.word_tokenize(value.lower())
print tokenized_article
break


And the output look likes: (I add ... to shorten the sample)

<type 'str'> A Negress Bernie ... they’re not going to take it anymore.

['a', 'negress', 'bernie', ... , 'they\u2019re', 'not', 'going', 'to', 'take', 'it', 'anymore', '.']


I don't understand why there is this '\u2019' that shouldn't be there. If someone can tell me how to get ride of it. I tried to encode in UTF-8 but I still got the same problem.

Answer
stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
    print type(value), value
    #replace value with ignored handler
    value = value.encode('ascii', 'ignore')
    tokenized_article = nltk.word_tokenize(value.lower())
    print tokenized_article
    break