I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
ftfy.fix_text("PÃºblica que cotiza en MÃ©xico")
>> 'Pública que cotiza en México'
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv import ftfy import sys def main(argv): # input file csvfile = open(argv, "r", encoding = "UTF8") reader = csv.DictReader(csvfile) # output stream outfile = open(argv, "w", encoding = "Windows-1252") # Windows doesn't like utf8 writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n") # clean values writer.writeheader() for row in reader: for col in row: row[col] = ftfy.fix_text(row[col]) writer.writerow(row) # close files csvfile.close() outfile.close() if __name__ == "__main__": main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.