Evgeny Evgeny - 2 months ago 16
C# Question

Remove all HTML tags and format text with returns, spaces, etc. using .NET

I have an issue to strip HTML and show as customer formatted text.

For example:

asdas<br/>asdas


So the tag will be replaced by a margin. But I also need to replace margins by spaces and tabs and remove all tags. Are there any examples or done solutions to get just somehow formatted text after HTML tags removal.

Current solution (searching for better and done):

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripAllTagsRegex(string source)
{
source = HttpUtility.HtmlEncode(source);
return _htmlRegex.Replace(source, string.Empty);
}

public static string ChangeTagsToTextFormat(string source)
{
if (string.IsNullOrEmpty(source))
return source;

source = HttpUtility.HtmlEncode(source);
return source.Replace("<br/>", Environment.NewLine)
.Replace("</div>", Environment.NewLine)
.Replace("</p>", Environment.NewLine);
}
}

Answer

I believe HTML Agility Pack is the simplest solution here, especially since your removing (possibly malformed) Html tags. The idea behind the following code is you just take all the nodes, return their InnerText along with a line break ("\n", or whatever formatting you want to do, since you'll have a Collection to work with after using SelectNodes):

    private string stripTags(string html)
    {
        var output = new StringBuilder();
        HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

        doc.LoadHtml(html);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*"))
        {
            output.AppendLine(node.InnerText + Environment.NewLine);
        }

        return output.ToString();
    }

To get more specific formatting results, simply use different XPath expressions with the SelectNodes method. (The code presented here not actually tested, and you'll probably want something a little more precise)