lidl lidl - 16 days ago 5
Python Question

re.search becomes unresponsive

When i run this code it doesn't print neither

'checked'
nor
'not matching'
. It stops responding completely.

url='http://hoswifi.bblink.cn/v3/2-fd1cc0657845832e5e1248e6539a50fa/topic/55-13950.html?from=home'

m=re.search(r'/\d-(B|(\w+){10,64})/index.html',url)
if m:
print('checked')
else:
print('not matching')

Answer

Suppose we have the following script:

s = '1234567890'    
m = re.search(r'(\w+)*z', s)

Our string contains 10 digits, and does not contain 'z'. This is intentional so that it forces re.search to check all possible combinations, otherwise it will stop on first match.

I can't calculate the number of possible combinations, since math involved is rather tricky, but here is a small demonstration on what happens when s gets more digits:

enter image description here

Time goes from roughly 1μs for a single digit s to 100 seconds for a 30 digit s, that is, 108 more time.


My guess is that something similar happens when you use (\w+){10,64}. Instead you should use \w{10,64}.


Code used for the demo:

import timeit
import matplotlib.pyplot as plt

setup = """
import re
"""    
_base_stmt = "m = re.search(r'(\w+)*z','{}')"

# (searched string becomes '1', '11', '111'...)
statements = {}
for i in range(1, 18):
    statements.update({i: _base_stmt.format('1'*i)})

# Creates x, y values
x = []
y = []
for i in sorted(statements):
    x.append(i)
    y.append(timeit.timeit(statements[i], setup, number=1))

# Plot
plt.plot(x, y)
plt.xlabel('string length')
plt.ylabel('time(sec)')
plt.show()