Geoff Geoff - 2 months ago 14
HTML Question

Regex (or alternative method) to remove content of a specfic tag in a HTML document

I'm trying to build a RegEx string for use in a find and replace in sublime text or notepad++ to remove strikethrough text from a html page. In general, the strikethrough is formatted as follows:

<span style="color: rgb(255,0,0);"><s>Some text here</s></span>


So far, I've come up with this:

<span.*<s>.*<\/s><\/span>


But it doesn't stop at the first
</span>
, it continues on so I get a huge slab of text selected. I've had a look at the regex wiki (and several other resources), and I'm sure this is a "greedy matches" issue, but I can't get my head around what that should look like.

Edit: I'm not set on RegEx by the way, if anyone has a better solution of how to achieve what I'm after I'm all ears

Answer

The best way to limit a greedy match is to make it stop at a specific character. [abc] is a character class meaning any of a, b, c, while [^abc] means anything but a, b, c. So [^<] means anything but <.

<span[^>]*><s>[^<]*</s></span>

The other (much slower) way is to set the * or + operator to return the shortest match. In Perl-compatible regex, you do this with *? or +?.