Dexter Ju Dexter Ju - 6 months ago 167
Python Question

Python:Got \xa0 instead of space in CSV and cannot remove or convert

I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.

I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.

I used

with io.open(train_fname) as f:
for line in f:
line = line.encode("ascii", "replace")


But it is not working, I always get the following output.


Imagine being able say, you know what, no sanctions, no forever hearings on IEAA regulations, no more hiding\xa0under\xa0the pretense of friendly nuclear energy. \xa0You have 2 days to; \xa0i.e. \xa0let in the inspectors, quit killing the civilians.


I tried other methods like

line.replace(u"\xa0", " ")

It is not working either, I also tried all kinds of encoding to open this CSV file in my text edit, sublime text.
I tried windows-1252, utf-8 and all other encodings, but I always get \xa0 is my text edit when viewing this CSV file.

Does this mean the


\xa0


is already written in this CSV file as input text? It is not a problem of python encoding? If it is this case, why cannot I use replace method to simply replace this string? The \xa0 indicates the file is encoding in which encode? This means this file is written in utf-8 but I tried to open it in ascii or other case?

I searched many questions but they don't seem provide much help. Please ask me if my question is not very clear.
Thank you very much!

`

Answer

The \xa0 that you see is a sequence of 4 characters: \ x a 0. All these characters are plain ASCII, so no character set problem here.

Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\. So try this:

line.replace("\\xa0", " ")

or:

line.replace(r"\xa0", " ")

The r in front of the string means to interpret each character literally, even a backslash.

Comments