I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)
I open the CSV using:
15 ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')
name=school_name.encode('utf-8'), street=row.encode('utf-8'), city=row.encode('utf-8'), state=row.encode('utf-8'), zip5=row, zip4=row,county=row.encode('utf-8'), lat=row, lng=row)
Traceback (most recent call last):
File "push_into_db.py", line 80, in <module>
File "push_into_db.py", line 74, in main
district_map = buildDistrictSchoolMap()
File "push_into_db.py", line 32, in buildDistrictSchoolMap
county=row.encode('utf-8'), lat=row, lng=row)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Unicode is not equal to UTF-8. The latter is just an encoding for the former.
You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.
So just replace
.decode, and it should work (if your .csv is UTF-8-encoded).
Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)
If your input data is not UTF-8 encoded, then you have to
.decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.