zeddex zeddex - 23 days ago 8
HTML Question

Regex that makes sure a match starts with a string

I am running a regex on some HTML and need to extract some image title tags.

The image title tags look like this:

title="Image Title Here"


And this works for the task:

(?<=title=").*?(?=")


However the problem is that it will grab unwanted title tags also. I noticed though in the HTML i run the regex on the images are inside h3 tags.

How can i update my regex to make sure it only gets matches from html starting with '

My current regex is:

(?<=<h3).*(?<=title=").*?(?=")

Answer

Using a DOMDocument with XPath should be less error prone:

$html = <<<DATA
<body>
<h1>Text 1<img title="Not this"></h1>
<h2>Text 2<img title="Not this"></h2>
<h3>Text 3<img title="This"></h3>
</body>
DATA;

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$imgs = $xpath->query('//h3/img[@title]');
$res = array();
foreach($imgs as $img) { 
   array_push($res, $img->getAttribute('title'));
}

print_r($res);

See the PHP demo

The '//h3/img[@title]' xpath expression will find all h3 tags that contain img children that contain title attributes, and $img->getAttribute('title') will get the value from these attributes.