Dexter Ju Dexter Ju - 1 year ago 293
Python Question

Python:Got \xa0 instead of space in CSV and cannot remove or convert

I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.

I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.

I used

with as f:
for line in f:
line = line.encode("ascii", "replace")

But it is not working, I always get the following output.

Imagine being able say, you know what, no sanctions, no forever hearings on IEAA regulations, no more hiding\xa0under\xa0the pretense of friendly nuclear energy. \xa0You have 2 days to; \xa0i.e. \xa0let in the inspectors, quit killing the civilians.

I tried other methods like

line.replace(u"\xa0", " ")

It is not working either, I also tried all kinds of encoding to open this CSV file in my text edit, sublime text.
I tried windows-1252, utf-8 and all other encodings, but I always get \xa0 is my text edit when viewing this CSV file.

Does this mean the


is already written in this CSV file as input text? It is not a problem of python encoding? If it is this case, why cannot I use replace method to simply replace this string? The \xa0 indicates the file is encoding in which encode? This means this file is written in utf-8 but I tried to open it in ascii or other case?

I searched many questions but they don't seem provide much help. Please ask me if my question is not very clear.
Thank you very much!


Answer Source

The \xa0 that you see is a sequence of 4 characters: \ x a 0. All these characters are plain ASCII, so no character set problem here.

Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\. So try this:

line.replace("\\xa0", " ")


line.replace(r"\xa0", " ")

The r in front of the string means to interpret each character literally, even a backslash.