Ninita Ninita - 15 days ago 4
C# Question

Using C# to remove custom xml tags from html

I have a string with some html code. However I need to parse that html to a

XDocument
.

string input = String.Concat("<root>", htmlString, "</root>");
var doc = XDocument.Parse(input);


But sometimes in my
htmlString
there is tags like
<o:p></o:p>
, for example, and with that in
XDocument.Parse()
I got the exception:


The ':' character, hexadecimal value 0x3A, cannot be included in a
name. Line 1, position 650.


How can I remove that tags or at least replace the
':'
in the tag name?

Before doing the parse I'm trying to remove/replace the
':'
but it isn't working:

try
{
Regex regex = new Regex(@"<[:][^>]+>.+?</\[:]>");
while (regex.IsMatch(htmlString))
{
htmlString= regex.Replace(htmlString, "");
}
}
catch { }


HTML example

<p>Some text</p>

<p class="MsoNormal" style="TEXT-ALIGN: justify; MARGIN: 0cm 0cm 0pt; LINE-HEIGHT: 150%">
<?xml:namespace prefix="o" ns="urn:schemas-microsoft-com:office:office"?>
<o:p> </o:p>
</p>

<p>More text</p>


UPDATE

I'm using
HtmlAgilityPack
but it doesn't remove this tags.

My code

ConfigureHtmlDocument();

var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(htmlString);

var htmlError = htmlDoc.ParseErrors.SafeAny();

if (!htmlError)
htmlString= htmlDoc.DocumentNode.InnerHtml;

try
{
Regex regex = new Regex(@"<[:][^>]+>.+?</\[:]>");
while (regex.IsMatch(htmlString))
{
htmlString= regex.Replace(htmlString, "");
}
}
catch { }

string input = String.Concat("<root>", htmlString, "</root>");
var doc = XDocument.Parse(input);

//more code


ConfigureHtmlDocument()

if (!HtmlNode.ElementsFlags.ContainsKey("p"))
HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("ul"))
HtmlNode.ElementsFlags.Add("ul", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["ul"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("li"))
HtmlNode.ElementsFlags.Add("li", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["li"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("ol"))
HtmlNode.ElementsFlags.Add("ol", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["ol"] = HtmlElementFlag.Closed;

//more similar code

Answer

Solved! The Regex expression is wrong. I replaced the expression with this:

//for remove xml declarations
htmlString = Regex.Replace(texto, @"<\?xml.*\?>", "");

//for remove costum tags like <o:p> and </o:p>
htmlString = Regex.Replace(texto, @"<(?:[\S]\:[\S])[^>]*>", "");
htmlString = Regex.Replace(texto, @"</(?:[\S]\:[\S])[^>]*>", ""); 

And now it works!

Comments