Constantin M Constantin M - 16 days ago 3
Python Question

Python encoding problems with PostgreSQL Database

I'm working on a program where I want to compare addresses I read from a csv file to a postgres Database. (Its a Plugin for QGis)
I can successfully establish a connection and also read data from the database as long as I send queries without my own parameters.

So what I do:
I read a csv file and store it in a list.
Then i select an output file.
Next I click a button, that on click, should compare the entries in the csv file to entries in my database.
If an entry from the csv file (postal code, town, address) has the exact same properties in the database, I write it into a list "Successful Matches", if one doesnt match, I write it into a List "Error List)

My problem now occurs, when i execute a statement with my own parameters.
The Sql Error Message I get back says:


Invalid Byte-Sequence for Encoding UTF8: 0xdf 0x65


I think, that the error is in the first list I fill from the csv file. My addresses have special characters like öäüß...

Here is the code that is used:

This Method writes the succesfully matched addresses to a file, the failed ones to a lineEdit

def write_output_file(self):
compare_input_with_database()
try:
with open(self.outputfile, 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in geocoded_list:
writer.writerow(row)
if len(error_list) > 0:
self.writefailedaddresses()
raiseInformation("Es konnten nicht alle Adressen geocodiert werden!")
else:
raiseInformation("Adressen erfolgreich geocodiert!")
except csv.Error:
raiseException("Fehler beim schreiben der Datei")



This method, compares a row entry from the list/csvfile to the database.


def compare_input_with_database():
dbcursor = database_connection.open_connection()
for row in addressList:
entry = str(row[0])
addresssplit = entry.split(';')
try:
resultset = database_connection.select_specific_address(dbcursor, int(addresssplit[0]), addresssplit[1], addresssplit[2])
geocoded_list.append(resultset)
except psycopg2.DatabaseError, e:
raiseException(e)
error_list.append(addresssplit)
database_connection.close_connection()

def select_specific_address(cursor, plz, town, address):
cursor.execute("SELECT plz,ort,strasse,breitengrad,laengengrad from addresses where plz=%s AND ort=%s AND strasse=%s", (plz, town, address))
resultset = cursor.fetchone()
return resultset



This Method reads a csv file and populates it in a list

def loadFileToList(addressfile, dlg):
del addressList[:]
if os.path.exists(addressfile):
if file_is_empty(addressfile):
raiseException("Ungueltige Quelldatei! Quelldatei ist leer!")
return -1
else:
with open(addressfile, 'rb') as csvfile:
filereader = csv.reader(csvfile, delimiter=';')
for row in filereader:
addressList.append(row)
return addressList
else:
raiseException("Pfad der Quelldatei nicht gefunden!")
return -1


Thanks!

EDIT:
When I display the address containing the special charachter it shows as "Hauptstra\xdfe" instead of "Hauptstraße
Sorry im Bad with Encoding, is this unicode?
Does that mean, that it get sends to the database like that and i nead to encode it differently?

EDIT 2:
I took a look at the orkaround and tried to implement it:

def loadFileToList(addressfile, dlg):
del addressList[:]
if os.path.exists(addressfile):
if file_is_empty(addressfile):
raiseException("Ungueltige Quelldatei! Quelldatei ist leer!")
return -1
else:
#with open(addressfile, 'rb') as csvfile:
#filereader = csv.reader(csvfile, delimiter=';')
reader = unicode_csv_reader(open(addressfile))
for row in reader:
addressList.append(row)
return addressList
else:
raiseException("Pfad der Quelldatei nicht gefunden!")
return -1


def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]


But now i get the following Error Message when executing the code:
for row in reader:


File "C:/Users/Constantin/.qgis2/python/plugins\Geocoder\logic.py",
line 46, in unicode_csv_reader
yield [unicode(cell, 'utf-8') for cell in row] UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 19: invalid
continuation byte


I just dont get why it just cant decode it -.-

Answer

The syntax of your except implies you are using Python 2. But since you are using non-ASCII (Unicode) characters in your strings, Python 3 is a dramatically better choice. You seem to be working in German, so at least a few ü and ß are going to sneak in.

In Python 3, the reading is just:

rows = []
with open('test.csv', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        rows.append(row)

And writing:

with open('testout.csv', mode="w", encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
    for row in rows:
        writer.writerow(row)

Note that the read/write is not binary, and encoding is handled as a matter of course. And the results are exactly what you'd expect. Something like:

Artikelname;Menge
Äpfel;3
Bäume;12

With all characters properly encoded. On disk, the data are UTF-8 encoded. In memory, full Unicode.

If you can't use Python 3, then you must take great care that Unicode characters are properly encoded and decoded, especially at I/O boundaries--e.g. when reading and writing data, or communicating with an external system like PostgreSQL. There are at least a few ways the code as it stands does not take care. The use of str() to cast what are not necessarily ASCII characters, for example, and the lack of UTF-8 encoding.

Unfortunately, there is no trivial fix in Python 2. Python 2's CSV module is schrecklich kaput. From the docs: "The csv module doesn’t directly support reading and writing Unicode." In 2016?! There are, however, workarounds. One of the recipes is right there in the docs. This Stack Overflow answer restates it, and provides several other alternatives.

So, use Python 3 if you can. It will simplify this and many other non-ASCII I/O issues. Otherwise, deploy the one of the CSV workarounds in that other SO answer.

Update

If you are having trouble using the workarounds, I don't like the standard answer either. Here is a non-canonical reader solution that works for me:

import csv

rows = []
with open('test.csv', mode='rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        urow = [unicode(cell, 'utf-8') for cell in row]
        rows.append(urow)

print rows

This is decidedly non-portable to Python 3, but it works and doesn't require importing any other modules. Note, if you're using IPython/Jupyter or the interactive Python console ("REPL"), you are going to see strings at a low level, such as:

[ [u'Artikelname', u'Menge'], 
  [u'\xc4pfel', u'3'], 
  [u'B\xe4ume', u'12] 
]

So instead of a nice, neat Ä, the string has \xc4. It's annoying, but it's not wrong. Entering u'\u00c4pfel' gives you the same thing. And it's easy to confirm that c4 is the correct Unicode code point. Python 2 is just doing a poor job of dealing with non-ASCII characters. Just one of 4,094 reasons to use Python 3 when you can.

The manual equivalent for output, btw:

with open('testout.csv', mode='wb') as csvfile:
    writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
    for row in rows:
        urow = [cell.encode('utf-8') for cell in row]
        writer.writerow(urow)

All of this depends utterly on your input files truly being in UTF-8 encoding. If they're in anything else, that opens another horrible kettle of rotting fish.