Robert Herrera Robert Herrera - 2 months ago 7
Python Question

Python: Replacing character error with read in file

Goal: I just want to take the comma away as that is the only character that will screw up my (course required) file parsing for bayesian analysis (i.e word,2,4) instead of say (i.e. word,,2,4)

So I'm currently trying to read in an email in the form of a text file from the Enron public corpus online and building a bayesian spam filter.

I've noticed that reading in some of the files are raising errors when trying to manipulate the strings that are present. I am fully aware that some of theses files contain viruses so the encoding of some of the characters might not be valid. However, I'm trying to simply replace a comma within a string and I'm getting the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 1169: ordinal not in range(128)

I have tried everything that this forum has to offer and i've searched everywhere for a solution such as:

with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read())
words = words.replace(',','')
words = words.split()


I've also tried many regex attempts... this is one of the versions:

with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read())
words = re.sub(',','',words)
words = words.split()


Now, I can simply just regex a version that only lets A-Za-z through but I'm noticing that spam accuracy is heavily being affected by the fact that a lot of the spam files have such special characters.

Any suggestion would be most appreciated. Thanks.

-Robert

Answer

If you just want to remove the extra comma and as you said nothing is working out you can use the simple split and join (assuming comma is the only delimiter here)

','.join([s for s in 'word,,2,4'.split(',') if s])