alvas alvas - 1 month ago 10
Python Question

Regex match (\w+) to capture single words delimited by ||| - Python

I am trying to match if there's singe word followed by

\s|||\s
and then another single word followed by
\s|||\s
so I'm using this regex:

single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'


And when I tried to match this string, the regex matching hangs or take minutes (possibly going into some sort of "deep loop")

>>> import re
>>> import time
>>> single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)
>>> re.match(single_word_regex, y, re.U)
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...


Why is it taking that long?

Is there a better/simpler regex to capture this condition
len(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())
?

Answer

Note that by itself, a r'(\w+)+' pattern will not cause the catastrophic backtracking, it will only be "evil" inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo and click the regex debugger on the left to see example regex engine behavior.

The current regex can be written as

r'^(\w+)\s\|{3}\s(\w+)\s\|{3}\s(.*)'

See the regex demo where there will be a match if you remove space and . in the second field.

Details:

  • ^ - start of string (not necessary with re.match)
  • (\w+) - (Group 1) 1+ letters/digits/underscores
  • \s - a whitespace
  • \|{3} - 3 pipe symbols
  • \s(\w+)\s\|{3}\s - see above ((\w+) creates Group 2)
  • (.*) - (Group 3) any 0+ characters other than linebreak characters.