user40037 user40037 - 24 days ago 7
Python Question

Unicode CSV Python

I'm not able to get this right. I've a CSV file which has already encoded characters in it (I made a smaller CSV file to test, but the original is way longer):

Isten H\xe1ta M\xf6g\xf6tt

Sigur R\xf3s

\xd3lafur

I can't get these strings to be decoded. I tried decoding it by simple reading the line and then do line.decode('latin1'), but it doesn't seem to work. When I looked at the raw string, I noticed that the characters are being escaped by an extra backslash. So, I tried to do an unicode-escape on the raw string first before doing the decoding; this also doesn't seem to work. The string stays the way it is (got the extra backslash removed though in the raw string).

When I hard-code a manual list with the example items, then the decoding works and I get the right characters back.

So, I only don't get it to work when I read it in from a CSV file. Anybody has an idea where it goes wrong?

Answer

Characters have different representations in-memory and in a file. A string can be encoded in several ways including a latin-1 encoding or utf-8 but in this case where we see a literal \xf6, what we have is a string that's been escaped. We can fix that by decoding the escapes

>>> print open('data.csv').readline().decode('string_escape')
Isten H�ta M�g�tt

But that only gets us half way, we are still encoded. Now a double decode

>>> print open('data.csv').readline().decode('string_escape').decode('latin1')
Isten Háta Mögött

Got it! The problem is in whatever wrote the file.