Konstantin Knyazev Konstantin Knyazev - 6 months ago 40
HTML Question

Invalid parsing ARTICLE tag by MSHTML

I'm trying to parse HTML by MSHTML parser in Delphi 10 Seattle. It works fine , but ARTICLE tag confuse it, parsed ARTICLE element does not have innerHTML and children, although they are there.

program Project1;

{$APPTYPE CONSOLE}

{$R *.res}

uses
System.SysUtils,
Variants,
ActiveX,
MSHTML;

procedure DoParse;
var
idoc: IHTMLDocument2;
iCollection: IHTMLElementCollection;
iElement: IHTMLElement;
V: OleVariant;
HTML: String;
i: Integer;
begin
Html :=
'<html>'#10+
'<head>'#10+
' <title>Articles</title>'#10+
'</head>'#10+
'<body>'#10+
' <article>'#10+
' <p>This is my Article</p>'#10+
' </article>'#10+
'</body>'#10+
'</html>';


v := VarArrayCreate( [0,1], varVariant);
v[0]:= Html;

idoc := CoHTMLDocument.Create as IHTMLDocument2;
idoc.designMode := 'on';
idoc.write(PSafeArray(System.TVarData(v).VArray));
idoc.close;

iCollection := idoc.all as IHTMLElementCollection;
for i := 0 to iCollection.length-1 do
begin
iElement := iCollection.item( i, 0) as IHTMLElement;
if assigned(ielement) then
WriteLN(iElement.tagName + ': ' + iElement.outerHTML);
end;
end;

begin
try
DoParse;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
ReadLN;
end.


Output of program is

HTML: <HTML><HEAD><TITLE>Articles</TITLE>
<META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD>
<BODY><ARTICLE>
<P>This is my Article</P></ARTICLE>undefined</BODY></HTML>
HEAD: <HEAD><TITLE>Articles</TITLE>
<META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD>
TITLE: <TITLE>Articles</TITLE>
META:
<META name=GENERATOR content="MSHTML 11.00.9600.18283">
BODY:
<BODY><ARTICLE>
<P>This is my Article</P></ARTICLE>undefined</BODY>
ARTICLE: <ARTICLE>
P:
<P>This is my Article</P>
/ARTICLE: </ARTICLE>


As you see, there are errors with ARTICLE tag, it does not have content and /ARTICLE is defined as separate tag.

Can someone help me to understand this issue? Thanks!

Best regards, Konstantin Knyazev

Answer

See the docs: custom element | custom object.

The Windows Internet Explorer support for custom tags on an HTML page requires that a namespace be defined for the tag. Otherwise, the custom tag is treated as an unknown tag when the document is parsed. Although navigating to a page with an unknown tag in Internet Explorer does not result in an error, unknown tags have the disadvantage of not being able to contain other tags, nor can they have behaviors applied to them.

In your case ARTICLE is an unknown tag. To make it a custom tag which can contain other tags, you need to add namespace to it. e.g. <MY:ARTICLE> and declare the namespace <html XMLNS:MY> (if you do not declare the namespace the DOM parser will add it automatically)

See also: Using Custom Tags in Internet Explorer


In your comment you mentioned that your are trying to parse a live HTML5 page (You did not mentioned that in the question).
Your program is running in IE7 compatibility mode by default. So either try to add <!DOCTYPE html> as the first line of the HTML and add <meta http-equiv="X-UA-Compatible" content="IE=edge"> as the first line of the HEAD section. Or try to add FEATURE_BROWSER_EMULATION registry key: How to have Delphi TWebbrowser component running in IE9 mode? I'm not sure if this will affect the standalone IHTMLDocument2. I leave the test up to you.

P.S: idoc.designMode := 'on'; is not needed.