natalie natalie - 3 days ago 4
Python Question

filter words from one text file in another text file?

I have a file that is a list of words- one word on each line- filterlist.txt.
The other file is a giant string of text- text.txt.

I want to find all the instances of the words from filterlist.txt in text.txt and delete them.

Here is what i have so far:

text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()

for line in text:
for word in filter_words:
if word == filter_words:
text.remove(word)

Answer

Store the filter words in a set, iterate over the words from the line in ttext.txt, and only keep the words that are not in the set of filter words.

with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(map(str.rstrip,filter_words))
    txt = next(text).split()
    out = [word  for word in txt if word not in st]

If you want to ignore case and remove punctuation you will need to call lower on each line and strip the punctuation:

from string import punctuation
with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(word.lower().rstrip(punctuation+"\n") for word in  filter_words)
    txt = next(text).lower().split()
    out = [word  for word in txt if word not in st]

If you had multiple lines in ttext using (word for line in text for word in line.split()) would be a more memory efficient approach.

Comments