user81993 user81993 - 1 month ago 5
PHP Question

getting the text content of a specific DOMElement

After a little hairpulling, I discovered that DOMElement->textContent also returns the combined text from the children of that element.

Looking around a bit I saw people suggesting DOMElement->firstChild->textContent but this is no good for me because I'm looking through the document following the hierarchy and cues from element attributes, the data is just as likely to be on a branch rather than a leaf so I would get multiple hits even though only one of them is the correct one.

Is there an actual way to get the text content of this one specific element and none of its childrens?

EDIT: nvm, found a way to make sure

function get_text($el) {
if (is_a($el->firstChild, "DOMText")) return $el->firstChild->textContent;
return "";
}

Answer

Simply iterate the child nodes and check if the next node is a text. You might want to skip the nodes consisting of only space characters, though:

function getNodeText(DOMNode $node) {
  if ($node->nodeType === XML_TEXT_NODE)
    return $node->textContent;

  $node = $node->firstChild;
  while ($node) {
    if ($node->nodeType === XML_TEXT_NODE &&
      $text = trim($node->textContent))
    {
      return $text;
    }
    $node = $node->nextSibling;
  }
  return '';
}

$xml = <<<'EOXML'
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <child>
    <x>x text</x>
    child text
  </child>
  root text
</root>
EOXML;


$doc = new DOMDocument();
$doc->loadXML($xml);

var_dump(getNodeText($doc->getElementsByTagName('x')[0]));
var_dump(getNodeText($doc->getElementsByTagName('root')[0]));
var_dump(getNodeText($doc->getElementsByTagName('child')[0]));

Sample output

string(6) "x text"
string(9) "root text"
string(10) "child text"