I'm not able to get this right. I've a CSV file which has already encoded characters in it (I made a smaller CSV file to test, but the original is way longer):
Isten H\xe1ta M\xf6g\xf6tt
I can't get these strings to be decoded. I tried decoding it by simple reading the line and then do line.decode('latin1'), but it doesn't seem to work. When I looked at the raw string, I noticed that the characters are being escaped by an extra backslash. So, I tried to do an unicode-escape on the raw string first before doing the decoding; this also doesn't seem to work. The string stays the way it is (got the extra backslash removed though in the raw string).
When I hard-code a manual list with the example items, then the decoding works and I get the right characters back.
So, I only don't get it to work when I read it in from a CSV file. Anybody has an idea where it goes wrong?
Characters have different representations in-memory and in a file. A string can be encoded in several ways including a
latin-1 encoding or
utf-8 but in this case where we see a literal
\xf6, what we have is a string that's been escaped. We can fix that by decoding the escapes
>>> print open('data.csv').readline().decode('string_escape') Isten H�ta M�g�tt
But that only gets us half way, we are still encoded. Now a double decode
>>> print open('data.csv').readline().decode('string_escape').decode('latin1') Isten Háta Mögött
Got it! The problem is in whatever wrote the file.