nonsensei nonsensei -4 years ago 90
Python Question

Why does this python regexp hangs?

regexp:

^(\w+,?\s?)+(?=:): hi hey\?$


input:

Aaaaaaaaa, bbbbbbbb, cccccccc, dddddddddd, eeeeeeeee: hi?


output: hangs

code

reg = re.compile('^(\w+,?\s?)+(?=:): hi hey\?$')
print reg.search('Aaaaaaaaa, bbbbbbbb, cccccccc, dddddddddd, eeeeeeeee: hi?')


Desired behaviour: find strings with the following pattern

[comma_and_(optionally)space_separated_values][colon][question]


Example of inputs that shouldn't match:


  • : qqq?
    (no values)

  • aa: qqq?
    (only one value)

  • aa, bb: qqq
    ( no question point)

  • aa, : qqq?
    (bad values format)

  • aa, bb, cc:?
    (no question)

  • , bb, cc: qqq?
    (bad values format)



Example of inputs that should match:


  • aa, bb: qq?

  • aa, b, c,d,e,f, g, h: qq?

  • aa, bb, cc: qq ee ff gggg hhhh?


Answer Source

It hangs because of this (\w+)+ scenario.
I.e. too complex on failure.
Works fine if it matches, blows up on failure.

This (\w,?\s?)+ is identical to (\w+,?\s?)+ but won't hang.

So, change it to this ^(\w,?\s?)+(?=:): hi hey\?$ and problem solved.

As a bonus, this ^(\w,?\s?)+: hi hey\?$ is identical.

Also, you can substitute .*?\?$ in place of your literal hi hey\?$
if expected to be variable literal.


Error: Target Operation .. 

The complexity of matching the regular 
expression exceeded predefined bounds.  
Try refactoring the regular expression 
to make each choice made by the state 
machine unambiguous.  This exception is 
thrown to prevent "eternal" matches that
take an indefinite period time to 
locate.

edit:

Note that there is always a potential problem with nested quantifiers.
I.e. those that are greedy and open ended, like (b+)*.

This can almost be cured by removing an inner nest (like b+ in the example).
By making it un-quantified, we can call that a pseudo anchor.

That is, it should be first in the group and is a un-quantified, required character.

This forces the engine on backtrack to go to that character again to check it.
If it is not quantified, it gives up immediately and will not even look at
the rest of the expression.
Thus it goes past that position in the string to find the next literal b.

That is basically what this backtracking cure is all about.

Given the backtrack pitfalls, we can make a solution to get the desired match.

^\s*(\w+\s*(?:[,\s]\s*\w+)+)\s*:\s*([^:]*?\w[^:]*?)\s*\?\s*$

Formatted

 ^                             # BOS
 \s*                           # Wsp trim
 (                             # (1 start), Values - minimum of 2 required
      \w+ \s*                       # First word
      (?: [,\s] \s* \w+ )+          # One or more space or comma seperated
                                    # word value's
 )                             # (1 end)
 \s*                           # Wsp trim
 :                             # Colon
 \s*                           # Wsp trim
 (                             # (2 start), Question -
      [^:]*?                        # Not a colon
      \w                            # At least a word char
      [^:]*?                        # Not a colon
 )                             # (2 end)
 \s*                           # Wsp trim
 \?                            # '?'
 \s*                           # Wsp trim
 $                             # EOS
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download