user2940666 user2940666 - 5 months ago 17
Python Question

Complex non-greedy matching with regular expressions

I'm trying to parse rows from a HTML table with cells containing specific values with regular expressions in Python. My aim in this (contrived) example is to get the rows with "cow".

import re

response = '''
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
'''

r = re.compile(r'<tr.*?cow.*?tr>', re.DOTALL)

for m in r.finditer(response):
print m.group(0), "\n"


My output is

<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>


<tr class="someClass"><td></td><td>cow</td></tr>


<tr class="someClass"><td></td><td>cow</td></tr>


While my aim is to get

<tr class="someClass"><td></td><td>cow</td></tr>


<tr class="someClass"><td></td><td>cow</td></tr>


<tr class="someClass"><td></td><td>cow</td></tr>


I understand that the non-greedy ? doesn't work in this case because of how backtracking works. I fiddled around with negative lookbehinds and lookahead but can't get it to work.

Does anybody have suggestions?

I'm aware of solutions like Beautiful Soup, etc. but the question is about understanding regular expressions, not the problem per se.

To address concerns of people about not using regular expressions for HTML. The general problem I want to solve using regular expressions ONLY is to get from

response = '''0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff10randomstuffB4randomstuff10randomstuffB5randomstuff1'''


the output

0randomstuffB3randomstuff1

0randomstuffB4randomstuff1

0randomstuffB5randomstuff1


and randomstuff should be interpreted as random strings (but not containing 0 or 1).

Answer

Your problem isn't related to the greediness but to the fact that the regex engine tries to succeed at each position in the string from left to right. That's why you will always obtain the leftmost result and using a non-greedy quantifier will not change the starting position!

If you write something like: <tr.*?cow.*?tr> or 0.*?B.*?1 (for your second example) the patterns are first tried:

  <tr class="someClass"><td></td><td>chicken</td></tr>...
# ^-----here

# or

  0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3ra...
# ^-----here

And the first .*? will eat characters until "cow" or "B". Result, the first match is:

<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>

for your first example, and:

0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff1

for the second.


To obtain what you want, you need to make the patterns fail at unwanted positions in the string. To do that .*? is useless because too permissive.

You can for instance forbid a </tr> or a 1 to occur before "cow" or "B".

# easy to write but not very efficient (with DOTALL)
<tr\b(?:(?!</tr>).)*?cow.*?</tr>

# more efficient
<tr\b[^<c]*(?:<(?!/tr>)[^<c]*|c(?!ow)[^<c]*)*cow.*?</tr>

# easier to write when boundaries are single characters
0[^01B]*B[^01]*1
Comments