Jared Smith Jared Smith - 3 months ago 21
Python Question

Python alphanumeric unicode regex not working as expected

Trying to write a python regex that will validate a string comprised of

  • Any unicode alphanumeric character (including combining characters)

  • Any number of space characters

  • Any number of underscores

  • Any number of dashes

  • Any number of periods

My test strings:

9 Melodía.de_la-montaña
9 Melodía.de_la-montaña

or as string literals produced with

str1 = '9 Melod\xeda.de_la-monta\xf1a'
str2 = '9 Melodi\u0301a.de_la-montan\u0303a'

These look identical but aren't, one is normalized and the other uses the combining characters for the inflections.

Here's my first stab:

import re

reg = re.compile("^[\w\.\- ]+$", re.IGNORECASE)
re.search(reg, str1) # None
re.search(reg, str2) # None

If I remove the positional qualifiers and use
instead of
I get lists like this
['9 Melodi', 'a.de_la-montan', 'a']
['9 Melod', 'a.de_la-monta', 'a']

I've even tried
re.compile("^[\w\.\- ]+$", re.IGNORECASE | re.UNICODE)
although that should be unnecessary in python 3 right?

In searching for an answer I've found this question and this one and this one and this one but they are all old, deal with python 2, and seem to suggest that the regex I wrote should work. The python 3.5 regex docs mention that
should match unicode but offer no actual examples involving non-ASCII text.

How do I match the desired strings?


Your first sample, str1, matches just fine; \w includes all Unicode word characters, including Latin characters with accents.

You can normalise your strings to the combined form with unicodedata.normalize(), use the NFC form:

>>> import re
>>> import unicodedata
>>> str1 = '9 Melod\xeda.de_la-monta\xf1a'
>>> str2 = '9 Melodi\u0301a.de_la-montan\u0303a'
>>> reg = re.compile("^[\w\.\- ]+$")
>>> reg.search(str1)
<_sre.SRE_Match object; span=(0, 23), match='9 Melodía.de_la-montaña'>
>>> reg.search(str2) is None
>>> reg.search(unicodedata.normalize('NFC', str2))
<_sre.SRE_Match object; span=(0, 23), match='9 Melodía.de_la-montaña'>

Note that the re.IGNORECASE flag is not needed, \w doesn't care about case.