JohnnyHunter JohnnyHunter - 28 days ago 4x
Python Question

Python cannot read a file which contains a specific string

I've written a function to remove certain words and characters for a string. The string in question is read into the program using a file. The program works fine except when a file, anywhere, contains the following anywhere in the body of the file.

Security Update for Secure Boot (3177404) This security update
resolves a vulnerability in Microsoft Windows. The vulnerability could
allow Secure Boot security features to be bypassed if an attacker
installs an affected policy on a target device. An attacker must have
either administrative privileges or physical access to install a
policy and bypass Secure Boot.

I've never experienced such weird behavior. Anybody have any suggestions?

This is the function I've written.

def scrub(file_name):
file = open(file_name,"r")
unscrubbed_string =

cms = open("common_misspellings.csv","r")
for line in cms:
replacement = line.strip('\n').split(',')
while replacement[0] in unscrubbed_string:
unscrubbed_string = unscrubbed_string.replace(replacement[0],replacement[1])


special_chars = ['.',',',';',"'","\""]

for char in special_chars:
while char in unscrubbed_string:
unscrubbed_string = unscrubbed_string.replace(char,"")

unscrubbed_list = unscrubbed_string.split()

noise = open("noise.txt","r")
noise_list = []

for word in noise:


for noise in noise_list:
while noise in unscrubbed_list:
return unscrubbed_list

print("""[*] File not found.""")

jez jez

Your code may be hanging because your .replace() call is in a while loop. If, for any particular line of your .csv file, the replacement[0] string is a substring of its corresponding replacement[1], and if either of them appears in your critical text, then the while loop will never finish. In fact, you don't need the while loop at all—a single .replace() call will replace all occurrences.

But that's only one example of the problems you'll encounter with your current approach of using a blanket unscrubbed_string.replace(...) You'll either need to use regular expression substitution (from the re) module, or break your string down into words yourself and work word-by-word instead. Why? Well, here's a simple example: 'Teh' needs to be corrected to 'The'—but what if the document contains a reference to 'Tehran'? Your "Secure Boot" text will contain an example analogous to this.

If you go the regular-expression route, the symbol \b solves this by matching word boundaries of any kind (start or end of string, spaces, punctuation). Here's a simplified example:

import re

replacements = {
unscrubbed = 'Teh capital of Iran is Tehran. Teh capital of France is Paris.'

better = unscrubbed
naive = unscrubbed
for target, replacement in replacements.items():
    naive = naive.replace(target, replacement)

    pattern = r'\b' + target + r'\b'
    better = re.sub(pattern, replacement, better)


Output, with mistakes emphasized:

Teh capital of Iran is Tehran. Teh capital of France is Paris. (unscrubbed)

The capital of Iran is Theran. The capital of France is Paris. (naive)

The capital of Iran is Tehran. The capital of France is Paris. (better)