Jared Smith Jared Smith - 9 days ago 5
Python Question

Python alphanumeric unicode regex not working as expected

Trying to write a python regex that will validate a string comprised of


  • Any unicode alphanumeric character (including combining characters)

  • Any number of space characters

  • Any number of underscores

  • Any number of dashes

  • Any number of periods



My test strings:

9 Melodía.de_la-montaña
9 Melodía.de_la-montaña


or as string literals produced with
ascii()
:

str1 = '9 Melod\xeda.de_la-monta\xf1a'
str2 = '9 Melodi\u0301a.de_la-montan\u0303a'


These look identical but aren't, one is normalized and the other uses the combining characters for the inflections.

Here's my first stab:

import re

reg = re.compile("^[\w\.\- ]+$", re.IGNORECASE)
re.search(reg, str1) # None
re.search(reg, str2) # None


If I remove the positional qualifiers and use
findall
instead of
search
I get lists like this
['9 Melodi', 'a.de_la-montan', 'a']
or
['9 Melod', 'a.de_la-monta', 'a']
.

I've even tried
re.compile("^[\w\.\- ]+$", re.IGNORECASE | re.UNICODE)
although that should be unnecessary in python 3 right?

In searching for an answer I've found this question and this one and this one and this one but they are all old, deal with python 2, and seem to suggest that the regex I wrote should work. The python 3.5 regex docs mention that
\w
should match unicode but offer no actual examples involving non-ASCII text.

How do I match the desired strings?

Answer

Your first sample, str1, matches just fine; \w includes all Unicode word characters, including Latin characters with accents.

You can normalise your strings to the combined form with unicodedata.normalize(), use the NFC form:

>>> import re
>>> import unicodedata
>>> str1 = '9 Melod\xeda.de_la-monta\xf1a'
>>> str2 = '9 Melodi\u0301a.de_la-montan\u0303a'
>>> reg = re.compile("^[\w\.\- ]+$")
>>> reg.search(str1)
<_sre.SRE_Match object; span=(0, 23), match='9 Melodía.de_la-montaña'>
>>> reg.search(str2) is None
True
>>> reg.search(unicodedata.normalize('NFC', str2))
<_sre.SRE_Match object; span=(0, 23), match='9 Melodía.de_la-montaña'>

Note that the re.IGNORECASE flag is not needed, \w doesn't care about case.