Cody Reandeau Cody Reandeau - 2 months ago 12
Python Question

Eliminating stop words from a text, while NOT deleting duplicate regular words

I'm trying to create a list with the most common 50 words within a specific text file, however I want to eliminate the stop words from that list. I have done that using this code.

from nltk.corpus import gutenberg
carroll = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
carroll_list = FreqDist(carroll)
stops = set(stopwords.words("english"))
filtered_words = [word for word in carroll_list if word not in stops]


However, this is deleting the duplicates of the words I want. Like when I do this:

fdist = FreqDist(filtered_words)
fdist.most_common(50)


I get the output:

[('right', 1), ('certain', 1), ('delighted', 1), ('adding', 1),
('work', 1), ('young', 1), ('Up', 1), ('soon', 1), ('use', 1),
('submitted', 1), ('remedies', 1), ('tis', 1), ('uncomfortable', 1)....]


It is saying that there is one instance of each word, clearly it eliminated the duplicates. I want to keep the duplicates so I can see what word is most common. Any help would be greatly appreciated.

Answer

As you have it written now, list is already a distribution containing the words as keys and the occurrence count as the value:

>>> list
FreqDist({u',': 1993, u"'": 1731, u'the': 1527, u'and': 802, u'.': 764, u'to': 725, u'a': 615, u'I': 543, u'it': 527, u'she': 509, ...})

You then iterate over the keys meaning each word is only there once. I believe you actually want to create filtered_words like this:

filtered_words = [word for word in carroll if word not in stops]

Also, you should try to avoid using variable names that match Python builtin functions (list is a Python builtin function).