Stephen Miller Stephen Miller - 4 months ago 12
HTML Question

How to get both element and content structure using PHPs DomDocument?

Say I wanted to implement automatic font request optimization based on the element and content structure of a page, how would I get the required info using PHPs DomDocument?

The problem in a nutshell can be illustrated with two structure examples:

Example 1

<p><em>All italic paragraph text</em></p>


Example 2

<p>Normal paragraph text <em>and some italic text</em></p>


The element structure is the same in the two examples, i.e. a paragraph element with an
<em>
child element. However, the content structure differs: All text is italic in example 1, but there is both normal and italic text in example 2.

My current approach for getting the element structure is something like this:

$dom = new DOMDocument;
foreach ($dom->getElementsByTagName('p') as $elm) {
$elms[] = $dom->saveHTML($elm);
}


I would then iterate trough the elements and use the same approach for finding nested elements such as
<em>
and
<strong>
.

But I need a good approach for the content structure. I guess I could split the text with
<em>
and
</em>
and see if the first and the last element in the resulting list have length, but that reminds me of custom HTML searching using regex, which seems to be the least recommended approach here.

But what are my alternatives in this case?

Answer

You can use DOMXPath to find the individual text nodes:

$html = "<p>Normal paragraph text <em>and some italic text</em></p>";

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$textNodes = $xpath->query("//text()");
$elms = [];
foreach ($textNodes as $elm) {
    $elms[] = array(
        "parent" => $elm->parentNode->tagName,
        "path" => $elm->parentNode->getNodePath(),
        "text"   => $elm->textContent
    );
}

$elms will contain:

array (
  array (
    'parent' => 'p',
    'path' => '/html/body/p',
    'text' => 'Normal paragraph text ',
  ),
  array (
    'parent' => 'em',
    'path' => '/html/body/p/em',
    'text' => 'and some italic text',
  ),
)