user984003 user984003 - 5 months ago 65
Python Question

Python, Remove characters, such as emoji, that cannot be handled by UTF8 MySQL DB

How can I replace characters, such as emojis

Answer

MySQL's utf8 encodes precisely the basic multilingual plane (BMP). Rather than specifically emoji, you need to exclude all code points from supplementary planes, since in MySQL these require utf8mb4.

Since you appear to be matching against 16 bit rather than 32 bit wide strings, a code point outside the BMP is encoded as a so-called "high surrogate" in the range 0xD800..0xDBFF, followed by a "low surrogate" in the range 0xDC00..0xDFFF. The corresponding regex therefore is:

u'[\ud800-\udbff][\udc00-\udfff]'.

♥ will not match this since it is u'\u2665'. I think strictly speaking it's only an emoji if followed by the variation selector U+FE0F, but either way it's safely in the BMP.

Comments