Stephen Miller Stephen Miller - 3 months ago 8x
HTML Question

How to get both element and content structure using PHPs DomDocument?

Say I wanted to implement automatic font request optimization based on the element and content structure of a page, how would I get the required info using PHPs DomDocument?

The problem in a nutshell can be illustrated with two structure examples:

Example 1

<p><em>All italic paragraph text</em></p>

Example 2

<p>Normal paragraph text <em>and some italic text</em></p>

The element structure is the same in the two examples, i.e. a paragraph element with an
child element. However, the content structure differs: All text is italic in example 1, but there is both normal and italic text in example 2.

My current approach for getting the element structure is something like this:

$dom = new DOMDocument;
foreach ($dom->getElementsByTagName('p') as $elm) {
$elms[] = $dom->saveHTML($elm);

I would then iterate trough the elements and use the same approach for finding nested elements such as

But I need a good approach for the content structure. I guess I could split the text with
and see if the first and the last element in the resulting list have length, but that reminds me of custom HTML searching using regex, which seems to be the least recommended approach here.

But what are my alternatives in this case?


You can use DOMXPath to find the individual text nodes:

$html = "<p>Normal paragraph text <em>and some italic text</em></p>";

$dom = new DOMDocument;
$xpath = new DOMXpath($dom);
$textNodes = $xpath->query("//text()");
$elms = [];
foreach ($textNodes as $elm) {
    $elms[] = array(
        "parent" => $elm->parentNode->tagName,
        "path" => $elm->parentNode->getNodePath(),
        "text"   => $elm->textContent

$elms will contain:

array (
  array (
    'parent' => 'p',
    'path' => '/html/body/p',
    'text' => 'Normal paragraph text ',
  array (
    'parent' => 'em',
    'path' => '/html/body/p/em',
    'text' => 'and some italic text',