Stenskjaer Stenskjaer - 2 months ago 7
Python Question

Python re.escape and r'\b' gives unexpected results

Say I want to match the presence of the phrase

Sortes\index[persons]{Sortes}
in the phrase
test Sortes\index[persons]{Sortes} text
.

Using python
re
I could do this:

>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>


This works, but I want to avoid the search pattern
Sortes
to give a positive result on the phrase
test Sortes\index[persons]{Sortes} text
.

>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>


So I use the
\b
pattern, like this:

search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)


Now, I don't get a match.

If the search pattern does not contain any of the characters
[]{}
, it works. E.g.:

>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>


Also, if I remove the final
r'\b'
, it also works:

re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>


Furthermore, the documentation says about
\b



Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.


So I tried replacing the final
\b
with
(\W|$)
:

>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>


Lo and behold, it works!
What is going on here? What am I missing?

Answer Source

See what a word boundary matches:

A word boundary can occur in one of three positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

In your pattern }\b only matches if there is a word char after } (a letter, digit or _).

When you use (\W|$) you require a non-word or end of string explicitly.

I always recommend unambiguous word boundaries based on negative lookarounds in these cases:

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.

Actually, it is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use (?<!\S) / (?!\S) lookaround boundaries).