Umair Umair - 1 month ago 11
Python Question

Python re.sub not working as expected

I have this HTML

b>Source: </b> <a href=\'http: //website.com/ml/datasets/Iris\'>text here</a><br><p class="normal">Creator: R.A. Fisher
<br><br>Donor: Namehere <b>\'@\'</b> website.com</u>)</p>


I want to remove multiple
<br>
from this using Regex

I am using this
_str = re.sub('<br>\s*','<br>',_str)


But it returns string as it was, with no change at all.

If I use same regex but specify a different replacing character then it works, this
_str = re.sub('<br>\s*','',_str)

Answer

You're only stripping off spaces following <br> with that. You can instead use a positive lookahead to remove all <br>s that have another <br> immediately following:

re.sub(r'<br>(?=<br>)', '', _str)

You may handle inter <br> spaces with:

re.sub(r'<br>(?=\s*<br>)', '', _str)