I need to parse a chunk of html, I obtain from a page, into an xml. Most of the tags convert fine when I put them into XmlDocument, except self-closing tags that are not closed (xmlDocument does not like those). Unfortunately I cannot add these in the page itself, since it is generated by a third party engine. So I have to add them myself. I am not that great at Regex so I need some help on how to add these "/" to one of these
Appreciate any input.
I would recommend using the HTML Agility Pack to parse it. The pack has the ability to write to XML and will take care of all of the closing of tags for you (as well as CDATA wrapping and other tricky problems you may run into). For example, this is how you can convert HTML to XML:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); string HTML = "<HTML><body><a href ='something'> <img src='a.jpg'></a></HTML>"; doc.LoadHtml(HTML); MemoryStream ms = new MemoryStream(); XmlWriter xml = XmlWriter.Create(ms); doc.OptionOutputAsXml = true; doc.Save(xml); ms.Position = 0; StreamReader sr = new StreamReader(ms); Debug.WriteLine (sr.ReadToEnd());
Which renders the output:
<?xml version="1.0" encoding="iso-8859-1"?><html><body><a href="something"> <img src="a.jpg" /></a></body></html>