pir pir - 1 year ago 112
Python Question

Checking if string contains unicode using standard Python

I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.

I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!

Answer Source

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:


where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.