Ross Rogers Ross Rogers - 7 months ago 12
Python Question

Why isn't my optional group greedy? /(5)?.*/

I thought that even though a group was optional

?
, that it would still be greedy and consume characters, if it could, before going to the next part of the regex.

When I specify the simplified regex
(5)?.*
versus
(5).*
(group 1 not optional), I see different behavior in python 2.7.6, even though I would expect the same behavior using the exact same string:

>>> import re
>>> s = 'before [5.5s] after'
>>> r = re.compile(r'(5)?.*')
>>> print r.search(s).groups()
(None,)

>>> r2 = re.compile(r'(5).*')
>>> print r2.search(s).groups()
('5',)


What am I not getting? Why is the first regex,
r
, not sucking up a 5?

Note: I need the theory of why, as any attempt at solving this particular regex won't help me. This is an SSCCE. I have a more complex regex and I really wish to fill in the gap of my knowledge as to why the optional group isn't being as greedy as I would have thought and would like.

Answer

First example:

  • Your regular expression is matched against the entire string s.
  • Therefore, the first character of s, which is a "b", is matched against (5)?, which doesn't result in a match. That's not a problem, however, because (5)? is an optional part of the pattern, so the regex engine matches it zero times and keeps advancing the current position in the pattern.
  • The rest of the string matches the rest of the pattern, so the entire string is a match. The group (5) itself, however, didn't match anything, so you're seeing the None in your first example.

Second example:

  • The 5 is no longer optional, so the first character of a potentially matching string has to be a "5". Therefore, a potential match starts at the "5" after "before [".
  • In order to be a match, the remaining string has to match the remaining pattern .*, which it does.

Note that in general, using the greedy .* is almost never what you want.