I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:
/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1
return self.cursor.execute(query, args)
WHITE MEDIUM SMALL SQUARE (U+25FD)
So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.
Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):
import re try: # UCS-4 highpoints = re.compile(u'[\U00010000-\U0010ffff]') except re.error: # UCS-2 highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') # mytext = u'<some string containing 4-byte chars>' mytext = highpoints.sub(u'\u25FD', mytext)
The character I'm replacing with is the
WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.
For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.
With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.
Hope this helps anyone else in the same situation!