IronBat IronBat - 1 year ago 123
Python Question

How to remove stop words using string.replace()

I have a text file where I am counting the sum of lines, sum of characters and sum of words. How can I clean the data by removing stop words such as (the, for, a) using string.replace()

I have the codes below as of now.

Ex. if the text file contains the line:

"The only words to count are Apple and Grapes for this text"

It should output:

2 Apple
2 Grapes
1 words
1 only
1 text

And should not output words like:

  • the

  • to

  • are

  • for

  • this

Below is the code I have as of now.

# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()

num_chars = len(fname)

num_lines = fname.count('\n')

fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
d[w] = 1

# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)

lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)

# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))

print('\n The 30 most frequent words are \n')

# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1


Answer Source

After opening and reading the file (fname = open('2013_honda_accord.txt', 'r').read()), you can place this code:

blacklist = ["the", "to", "are", "for", "this"]  # Blacklist of words to be filtered out
for word in blacklist:
    fname = fname.replace(word, "")

# The above causes multiple spaces in the text (e.g. '  Apple    Grapes  Apple')
while "  " in fname:
    fname = fname.replace("  ", " ")  # Replace double spaces by one while double spaces are in text

Edit: To avoid problems with words containing the unwanted words, you may do it like this (assuming words are in sentence middle):

blacklist = ["the", "to", "are", "for", "this"]  # Blacklist of words to be filtered out
for word in blacklist:
    fname = fname.replace(" " + word + " ", " ")
# Or .'!? ect.

A check for double spaces is not required here.

Hope this helps!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download