Invisible Invisible - 1 year ago 37
HTML Question

A simple Regex issue

I am trying to create a regex for the following String-

<td colspan=2>
CITY Head:
<span >
<span >##CITY##</span>
<td colspan=1>

I want to find the whole TD block having CITY Head in it. I could come with the following regex.

<td(.*)[\s](.*)[\s]+CITY Head+(.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s]+<\/td>

Basically I had to write
for all the lines above and below the CITY Head. But this can be different in different cases.

Therefore, I am looking for a general way to combine all the
into something independent of the number of lines.


[\s\S]*? will match the smallest possible number (* = 0 or more, ? = ungreedy) of whitespace (\s) or non-whitespace (\S) (ie any) characters.

<td((?!<\/?td)[\s\S])*?CITY Head[\s\S]*?<\/td>

The assertion (?!<\/?td) makes sure the section before CITY Head doesn't span more than one table cell.

But using a regex isn't a reliable way of parsing HTML. In particular, this regex might pull out the wrong result if the HTML contains a syntax error.