mus_siluanus mus_siluanus - 1 month ago 7
Python Question

Regex: exception to negative character class

Using Python with Matthew Barnett's regex module.

I have this string:

The well known *H*rry P*tter*.


I'm using this regex to process the asterisks to obtain
<em>H*rry P*tter</em>
:

REG = re.compile(r"""
(?<!\p{L}|\p{N}|\\)
\*
([^\*]*?) # I need this part to deal with nested patterns; I really can't omit it
\*
(?!\p{L}|\p{N})
""", re.VERBOSE)


PROBLEM



The problem is that this regex doesn't match this kind of strings unless I protect intraword asterisks first (I convert them to decimal entities), which is awfully expensive in documents with lots of asterisks.

QUESTION



Is it possible to tell the negative class to block at internal asterisks only if they are not surrounded by word characters?

I tried these patterns in vain:


  • ([^(?:[^\p{L}|\p{N}]\*[^\p{L}|\p{N}])]*?)

  • ([^(?<!\p{L}\p{N})\*(?!\p{L}\p{N})]*?)


Answer

I suggest a single regex replacement for the cases like you mentioned above:

re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'<em>\1</em>', s)

See the regex demo

Details:

  • \B\*\b - a * that is preceded with a non-word boundary and followed with a word boundary
  • ([^*]*(?:\b\*\b[^*]*)*) - Group 1 capturing:
    • [^*]* - 0+ chars other than *
    • (?:\b\*\b[^*]*)* - zero or more sequences of:
      • \b\*\b - a * enclosed with word boundaries
      • [^*]* - 0+ chars other than *
  • \b\*\B - a * that is followed with a non-word boundary and preceded with a word boundary

More information on word boundaries and non-word boundaries:

Comments