I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.
I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.
with io.open(train_fname) as f:
for line in f:
line = line.encode("ascii", "replace")
Imagine being able say, you know what, no sanctions, no forever hearings on IEAA regulations, no more hiding\xa0under\xa0the pretense of friendly nuclear energy. \xa0You have 2 days to; \xa0i.e. \xa0let in the inspectors, quit killing the civilians.
line.replace(u"\xa0", " ")
\xa0 that you see is a sequence of 4 characters:
0. All these characters are plain ASCII, so no character set problem here.
Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written
\\. So try this:
line.replace("\\xa0", " ")
line.replace(r"\xa0", " ")
r in front of the string means to interpret each character literally, even a backslash.