Bruce Bruce - 20 days ago 10
PHP Question

testing contents of loadHTML($str)

Right now the code below performs a test on an html document to see if any h1 or h2 tags contain the string $title. The code works flawlessly.

$s1='random text';
$a1='random anchor text';
$href1='http://www.someurl.com';
$document = new DOMDocument();
$libxml_previous_state = libxml_use_internal_errors(true);
$document->loadHTML($str);
libxml_use_internal_errors($libxml_previous_state);

$tags = array ('h1', 'h2');
$texts = array ();
foreach($tags as $tag)
{
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
$texts[$element->tagName] = strtolower($element->textContent);
}
}

if(in_array(strtolower($title),$texts)) {
echo '<div class="success"><i class="fa fa-check-square-o" style="color:green"></i> This article used the correct title tag.</div>';
} else {
echo '<div class="error"><i class="fa fa-times-circle-o" style="color:red"></i> This article did not use the correct title tag.</div>';
}


I need to run three more tests, first I need to scan the document for the existence of $s1, but can't figure this out. With the working code it is looking for an exact match inside the h1 or h2 tags. However with the $s1 I an not looking for exact match, just anywhere that text exists - whether surrounded by other text or not.

Then I need another exact match test to look for $a1 in the "a" text and I also need to test the href for the existence of $href1.

I am not sure how to do these tests. I am sure I could get the $a1 test as its just another exact match, but not sure how to do the href test nor for to scan for a string, that may be surrounded by other text.

Hope this all makes sense.

Update

I need a solution that allows me to echo a single "yes the string exists" or "no it doesn't". Similar to the way the current test echo's only ones, not once each loop. I need to do this once per test.

Example results would look like:

yes $s1 is in the document
no $s1 is not in the document
yes $href1 is an href in the document
no $href1 is not an href in the document
yes $a1 is an anchor text in the document
no $a1 is not an anchor text in the document


I also believe I should be using substr() but I am not sure exactly how.

Hoping for some working examples and detailed explanations.

Answer

Here is the code that extracts (1) anchors href (2) anchor text (3) h1 text (4) h2 text (5) text fragments from all text nodes and stores them in arrays. Later, it searches through those arrays for exact/partial matches of the same.

We did it with xquery because it seems easier to extract text from leaf nodes using it.

Code:

<?php
    /* returns true if an exact match for $str is found in items of $arr array */
    function find_exact($str, array $arr) {
      foreach ($arr as $i) {if (!strcasecmp($i,$str)) {return(true);}}
      return(false);
    }

    /* returns true if a partial/exact match for $str is found in items of $arr array */
    function find_within($str, array $arr) {
      foreach ($arr as $i) {if (stripos($i,$str)!==false) {return(true);}}
      return(false);
    }

    $s1='random text';
    $a1='random anchor text';
    $href1='http://www.someurl.com';
    $document = new DOMDocument();
    $libxml_previous_state = libxml_use_internal_errors(true);

    /* Sample document. Just for testing */
    $str=<<<END_OF_DOC
<h1>abc h1title def</h1>
<h2>h2title</h2>
<div>some random text here</div>
<div>two</div>three
<a href='http://www.someurl.com'>some random anchor text here</a>
<span>four</span>five<span>six<b>boldscript</b></span>
END_OF_DOC;

    $document->loadHTML($str);
    libxml_use_internal_errors($libxml_previous_state);

    /* We extract the texts into these arrays, for matching later */
    $a_texts=array(); $a_hrefs=array(); $h1_texts=array(); $h2_texts=array(); $all_texts=array();

    /* We use XPath because it seems easier for extracting text nodes */
    $xp = new DOMXPath($document); $eList=$xp->query("//node()");
    foreach ($eList as $e) {
      //print "Node {".$e->nodeName."} {".$e->nodeType."} {".$e->nodeValue."} {".$e->textContent."}<br/>";
      if (!strcasecmp($e->nodeName,"a")) { array_push($a_texts,$e->textContent);array_push($a_hrefs,$e->getAttribute("href")); }
      if (!strcasecmp($e->nodeName,"h1")) {array_push($h1_texts,$e->textContent);}
      if (!strcasecmp($e->nodeName,"h2")) {array_push($h2_texts,$e->textContent);}
      if ($e->nodeType === XML_TEXT_NODE) {array_push($all_texts,$e->textContent);}
    }

    //var_dump($a_texts); print("<br/>"); var_dump($a_hrefs); print("<br/>"); var_dump($h1_texts); print("<br/>");
    //var_dump($h2_texts);print("<br/>");var_dump($all_texts);print("<br/>");

    if (find_within($s1,$all_texts)) { print "yes $s1 is in the document<br/>"; }
    else { print "no $s1 is not in the document<br/>"; }

    if (find_exact($href1,$a_hrefs)) { print "yes $href1 is an href in the document<br/>"; }
    else { print "no $href1 is not an href in the document<br/>"; }

    if (find_within($a1,$a_texts)) { print "yes $a1 is an anchor text in the document<br/>"; }
    else { print "no $a1 is not an anchor text in the document<br/>"; }
?>

Result:

yes random text is in the document
yes http://www.someurl.com is an href in the document
yes random anchor text is an anchor text in the document
Comments