Mathias Mathias - 4 months ago 20
Python Question

UTF-8 Python issues with Google Datastore

I've been roaming around these forums asking questions about issues related to Python and UTF-8 encoding/decoding.

This time around I've stumbled upon something which initially seemed an easy problem.

In my previous question ( I asked how to ensure proper addition of UTF-8 strings to variables:

Messages.append(ChatMessage(chatter, msg))

The solution was something along those lines:

Messages.append(ChatMessage(chatter.encode( "utf-8" ), msg.encode( "utf-8" )))

Pretty simple.

However, now I am faced with the challenge to send the data to Google App Engine Datastore. The code from the book I was using (Code in the Cloud)looked as follows (I skipped the redundant parts):

#START: ChatMessage
class ChatMessage(db.Model):
user = db.StringProperty(required=True)
timestamp = db.DateTimeProperty(auto_now_add=True)
message = db.TextProperty(required=True)

def __str__(self):
return "%s (%s): %s" % (self.user, self.timestamp, self.message)
#END: ChatMessage

# START: PostHandler
class ChatRoomPoster(webapp.RequestHandler):
def post(self):
chatter = self.request.get("name")
msgtext = self.request.get("message")
msg = ChatMessage(user=chatter, message=msgtext)
msg.put() #<callout id="co.put"/>
# END: PostHandler

I thought that swaping a part of the PostHandler with the following bit:

msg = ChatMessage(user=chatter.encode( "utf-8" ), message=msgtext.encode( "utf-8" ))

... would do the trick. Unfortunately, that did not happen. I still keep getting

File "/base/data/home/apps/s~markcc-chatroom-one-pl/1.353054484690143927/", line 147, in post
msg = ChatMessage(user=chatter.encode( "utf-8" ), message=msgtext.encode( "utf-8" ))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

Naturally, I declared (# -- coding: utf-8 --) statement and put:

self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'

in the file. It does nothing to alleviate the issue.

As you can see I am not very well-versed in Python, and encoding/decoding problems are, for me, a bit of novelty. I would appreciate your assistance. If anyonone could explain to me where I went wrong in this case and what practices to use to avoid similar quandaries in the future? Thank you in advance.


encode turns unicode into bytes, and decode turns bytes into unicode. You have to be careful not to mix the two. Your error means either:

  1. chatter or msgtext is already bytes, and you are trying to encode it. One of the worst 'features' of Python 2 is that it lets you do this - it tries to first decode the bytes using ascii (the most limited encoding), and then re-encode them with whatever you've asked for. This is fixed in Python 3, but you can't use that on App Engine.

  2. App Engine expects to store unicode (it does). So you need to pass it a unicode string without encoding it. In fact, if your data is already in a bytestring, you would need to decode it before you can store it.

In short, the first thing to try is simply not calling .encode before you store the data.

(I may have pointed you to it before, but if not, please take the time to read this article about unicode)