Invisible Invisible - 5 months ago 7
HTML Question

A simple Regex issue

I am trying to create a regex for the following String-

<tr>
<td colspan=2>
<p><b>
CITY Head:
<span >
<span >##CITY##</span>
<o:p></o:p>
</span>
</b>
</p>
</td>
<td colspan=1>


I want to find the whole TD block having CITY Head in it. I could come with the following regex.

<td(.*)[\s](.*)[\s]+CITY Head+(.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s]+<\/td>


Basically I had to write
(.*)[\s]
for all the lines above and below the CITY Head. But this can be different in different cases.

Therefore, I am looking for a general way to combine all the
(.*)[\s]
into something independent of the number of lines.

Answer

[\s\S]*? will match the smallest possible number (* = 0 or more, ? = ungreedy) of whitespace (\s) or non-whitespace (\S) (ie any) characters.

<td((?!<\/?td)[\s\S])*?CITY Head[\s\S]*?<\/td>

The assertion (?!<\/?td) makes sure the section before CITY Head doesn't span more than one table cell.

But using a regex isn't a reliable way of parsing HTML. In particular, this regex might pull out the wrong result if the HTML contains a syntax error.