Bj&#246;rn - 11 months ago 44
HTML Question

How to prevent the PHP DomDocument from "fixing" your HTML string

I have been trying to parse webpages by use of the HTML DomObject in order to use them for an application to scan them for SEO quality.

However i have run into abit of a problem. For testing purposes i've written a small html page containing the following incorrect html :

<head>
<meta name="description" content="randomdesciption">
<title>sometitle</title>


As you can see the title is outside the head tag wich is the error i am trying to detect.

Now comes the problem, when i use curl to catch the responce string from this page then send it to the dom document to load it as HTML it actually fixes this by ADDING another tags around the title.

<head>
<meta name="description" content="randomdesciption">


I have checked the curl responce data and that infact is not the problem, somehow the php DomDocument during the execution of the loadHTML() method fixes the html syntax.

I have also tried turning off the DomDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without succes.

I have been searching google but i am unable to find any answers so far. I guess it is abit rare for some one that actually want the broken HTML not being fixed.

Anyone know how to prevent the DomDocument from fixing my broken html?

UPDATE: as of PHP 5.4 you can use HTML_PARSE_NO_IMPLIED

$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);


You cant. In theory there is a flag HTML_PARSE_NO_IMPLIED for that in libxml to prevent adding implied markup, but its not accessible from PHP.

On a sidenote, this particular behavior seems to depend on the LIBXML_VERSION used.

Running this snippet:

<?php
$html = <<< HTML <head> <meta name="description" content="randomdesciption"> </head> <title>sometitle</title> HTML;$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true; echo$dom->saveHTML(), LIBXML_VERSION;


on my machine will give

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>