RandomCoder RandomCoder - 5 months ago 78
PHP Question

DOMDocument remove script tags from HTML source

I used @Alex's approach here to remove script tags from a HTML document using the built in DOMDocument. The problem is if I have a script tag with Javascript content and then another script tag that links to an external Javascript source file, not all script tags are removed from the HTML.

Codepad demo

$result = '
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>
hey
</title>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
alert("hello");
</script>
</head>
<body>hey</body>
</html>
';

$dom = new DOMDocument();
if($dom->loadHTML($result))
{
$script_tags = $dom->getElementsByTagName('script');

$length = $script_tags->length;

for ($i = 0; $i < $length; $i++) {
if(is_object($script_tags->item($i)->parentNode)) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
}

echo $dom->saveHTML();
}


The above code outputs:



<html>
<head>
<meta charset="utf-8">
<title>hey</title>
<script>
alert("hello");
</script>
</head>
<body>
hey
</body>
</html>


As you can see from the output, only the external script tag was removed. Is there anything I can do to ensure all script tags are removed?

Answer

Your error is actually trivial. A DOMNode object (and all its descendants - DOMElement, DOMNodeList and a few others!) is automatically updated when its parent element changes, most notably when its number of children change. This is written on a couple of lines in the PHP doc, but is mostly swept under the carpet.

If you loop using ($k instanceof DOMNode)->length, and subsequently remove elements from the nodes, you'll notice that the length property actually changes! I had to write my own library to counteract this and a few other quirks.

The solution:

if($dom->loadHTML($result))
{
    while (($r = $dom->getElementsByTagName("script")) && $r->length) {
            $r->item(0)->parentNode->removeChild($r->item(0));
    }
echo $dom->saveHTML();

I'm not actually looping - just popping the first element one at a time. The result: http://sebrenauld.co.uk/domremovescript.php