luisfego luisfego - 1 year ago 98
Python Question

Mixed encoding in csv file

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this


I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.

So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:

import ftfy

file = open("file.csv", "r", encoding = "UTF8")
content =

content = ftfy.fix_text(content)

will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.

Nevertheless, if I run
ftfy.fix_text("Pública que cotiza en México")
it will show the right response:

>> 'Pública que cotiza en México'

I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated

Any suggestions?

Answer Source

In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.

The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.

Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).

This is the code I used to process the file:

import csv
import ftfy
import sys

def main(argv):
    # input file
    csvfile = open(argv[1], "r", encoding = "UTF8")
    reader = csv.DictReader(csvfile)

    # output stream
    outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
    writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")

    # clean values
    for row in reader:
        for col in row:
            row[col] = ftfy.fix_text(row[col])

    # close files

if __name__ == "__main__":

And then, calling:

$ python data.csv out.csv

will output a csv file with the right encoding.