Ziv Ziv - 21 days ago 7
C# Question

How to read HTML as XML?

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.

My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:

public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());

string html = sr.ReadToEnd();
sr.Close();
return html;
}


When I try to load that string using LoadXml(string xml) I get the exception

'--' is an unexpected token. The expected token is '>'


What way should I take to read the html file to a parsable XML

Answer

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

Comments