Nicolas Nicolas - 26 days ago 14
HTML Question

PHP - Parse html to retrieve href from an "a" tag that is inside an other "a" tag

I've been searching for hours (there shouldn't be any duplicate) and tried many different ways using both regex (regular expressions) and DOMdocument without success.

How the non-standard html code looks like:

<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
<a href="SOME_URL_3">SOME TEXT</a>
</a>


Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using regex or DOMdocument, the pasing stops as soon as it encounters the first href. Of course as the second "a" tag is part of the first one, the parser only see it as one.

I observed that browsers seems to automatically separate the tags when parsing as follow:

Before:

<a href="SOME_URL">
<a href="SOME_URL_2">
</a>
</a>


After:

<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>


I've not been able to replicate this browsers behavior using php.

What I have tried that came closer to work:

$dom = new DOMDocument();
@$dom->loadHTML($result);

foreach($dom->getElementsByTagName('a') as $link) {
$href_count = 0;
$attrs = array();

for ($i = 0; $i < $link->attributes->length; ++$i) {
$node = $link->attributes->item($i);
if ($node->nodeName == "href") {
$attrs[$node->nodeName][$href_count] = $node->nodeValue;
$href_count++;
if ($href_count >= 2) {
echo "A second href has been found";
}
}
}

echo "<pre>";
var_dump($attrs);
echo "</pre>";
}


As you may expect it unfortunately doesn't work, in that case I wouldn't be here asking for help...

Please don't hesitate to share your knowledge, any help or suggestion will be greatly appreciated!




Update



I had forgotten to specify in my initial question that the answer should still allow to capture standard href. My goal is to "extend" or "improve" my actual html parser to ensure I'm also retrieving the urls from any href. My initial code was only using RegEx and I wasn't able to capture second href from nested "a" tags. A perfect answer would allow to capture both nested and standard href. Brandon White's solution is perfect for nested href only but it would be resource consuming to use two different RegEx (nested/standard) to parse the entire html content twice. An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible.

Answer

I've been able to achieve my goal using the solution below:

$result = <<<HTML
<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

<a href="SOME_URL3">
    <a href="SOME_URL_4">
    </a>
</a>

<a href="SOME_URL_5">
</a>
<a href="SOME_URL_6">
</a>

HTML;

$dom = new DOMDocument();
@$dom->loadHTML($result);


foreach($dom->getElementsByTagName('a') as $link) {

    $tag_html = $dom->saveHTML($link); //Get tag inner html

    if (substr_count($tag_html, "href") > 1) { //If tag contains more than one href attribute
        preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
        $output[] = $link_output[1][1]; //Output second href
    } else { //Not nested tag
        $output[] = $link->getAttribute('href'); //Output first href
    }
}

echo "<pre>".print_r($output)."</pre>";

Output:

array
(
    [0] => SOME_URL_2
    [1] => SOME_URL_4
    [2] => SOME_URL_5
    [3] => SOME_URL_6
)

This solution works with entire html pages with mixed and/or nested content. It allows to capture as many nested href as needed while still capturing standard href "a" tags.