Jialun Liu Jialun Liu - 2 months ago 18
Python Question

How to use NLTK snowball stemmer to stem a list of Spanish words Python

I am trying to use the NLTK snowball stemmer to stem Spanish, and I ran into some encoding issues that I don't have any idea about.

Here's a example sentence I am trying to operate on:


En diciembre, los precios de la energía subieron un 1,4 por ciento, los de la vivienda aumentaron un 0,1 por ciento y los precios de la vestimenta se mantuvieron sin cambios, mientras que los de los automóviles nuevos bajaron un 0,1 por ciento y los de los pasajes de avión cayeron el 0,7 por ciento.


First, I read the sentence from a xml file using the code:

from nltk.stem.snowball import SnowballStemmer
import xml.etree.ElementTree as ET

stemmer = SnowballStemmer("spanish")
sentence = ET.tostring(context, encoding='utf-8', method="text").lower()


Then after tokenize the sentence to get a list of words, I tried to stem each word:

stem = stemmer.stem(words[headIndex - index])


And the error is coming from this line:

Traceback (most recent call last):
File "main.py", line 150, in <module>
main()
File "main.py", line 142, in main
vectorDict, vocabulary = englishXml(language)
File "main.py", line 86, in englishXml
stem = stemmer.stem(words[headIndex - index])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 3404, in stem
r1, r2 = self._r1r2_standard(word, self.__vowels)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 232, in _r1r2_standard
if word[i] not in vowels and word[i-1] in vowels:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


I also tried to read the sentence from the xml file without "utf-8" encoding, but the problem is that ".lower()" wouldn't work in that case:

sentence = ET.tostring(context, method="text").lower()


And the error in this case becomes:

Traceback (most recent call last):
File "main.py", line 154, in <module>
main()
File "main.py", line 146, in main
vectorDict, vocabulary = englishXml(language)
File "main.py", line 63, in englishXml
sentence = ET.tostring(context, method="text").lower()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 814, in write
_serialize_text(write, self._root, encoding)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1006, in _serialize_text
write(part.encode(encoding))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 18: ordinal not in range(128)


Thanks in advance!

Answer

Try adding this before stemming

sentence = sentence.decode('utf8')