mus_siluanus mus_siluanus - 1 year ago 100
Python Question

Regex: exception to negative character class

Using Python with Matthew Barnett's regex module.

I have this string:

The well known *H*rry P*tter*.

I'm using this regex to process the asterisks to obtain
<em>H*rry P*tter</em>

REG = re.compile(r"""
([^\*]*?) # I need this part to deal with nested patterns; I really can't omit it
""", re.VERBOSE)


The problem is that this regex doesn't match this kind of strings unless I protect intraword asterisks first (I convert them to decimal entities), which is awfully expensive in documents with lots of asterisks.


Is it possible to tell the negative class to block at internal asterisks only if they are not surrounded by word characters?

I tried these patterns in vain:

  • ([^(?:[^\p{L}|\p{N}]\*[^\p{L}|\p{N}])]*?)

  • ([^(?<!\p{L}\p{N})\*(?!\p{L}\p{N})]*?)

Answer Source

I suggest a single regex replacement for the cases like you mentioned above:

re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'<em>\1</em>', s)

See the regex demo


  • \B\*\b - a * that is preceded with a non-word boundary and followed with a word boundary
  • ([^*]*(?:\b\*\b[^*]*)*) - Group 1 capturing:
    • [^*]* - 0+ chars other than *
    • (?:\b\*\b[^*]*)* - zero or more sequences of:
      • \b\*\b - a * enclosed with word boundaries
      • [^*]* - 0+ chars other than *
  • \b\*\B - a * that is followed with a non-word boundary and preceded with a word boundary

More information on word boundaries and non-word boundaries:

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download