bearaman bearaman - 1 day ago 4
C# Question

Regular Expression to clear attributes from a html tag

I have a pretty simple reg ex question. My HTML tag looks like the following:

<body lang=EN-US link=blue vlink=purple>


I want to clear all attributes and just return
<body>


There are a number of other HTML tags whose attributes I'd like to clear so I hope to reuse the solution. How to do this with a regular expression?
Thanks,
B.

Answer

Use HtmlAgilityPack like this:

    public string RemoveAllAttributesFromEveryNode(string html)
    {
        var htmlDocument = new HtmlAgilityPack.HtmlDocument();
        htmlDocument.LoadHtml(html);
        foreach (var eachNode in htmlDocument.DocumentNode.SelectNodes("//*"))
            eachNode.Attributes.RemoveAll();
        html = htmlDocument.DocumentNode.OuterHtml;
        return html;
    }

Call this method passing the html that you want to remove all attributes from.

will help you a lot with this.

Don't use a regex for html files that may contain scripts, as in Javascript, the characters < and > are not tag delimiters but operators. A Regexp will probably match these operators as if they were tags, which will completely mess up the document.

Comments