media media - 1 month ago 13
C# Question

Removing html contents from a web request using C#

I have the following code in C# which gets the content of a web page and store it in a string variable.


WebRequest request = WebRequest.Create("http://www.arsenal.com");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}


The code works perfect but my point is that I need to store the content of the page without the
html
tags and
Javascript
stuff. Is there any way to do so?

Actually I have found some ways for removing
html
tags but
Javascript
and
CSS
styles still bother me. I have to mention that the way for removing
html
is also not working well.

Answer

As this question suggests, it's a tricky process parsing HTML and the best approach is to use a library.

I've used the HTML Agility Pack before with some success though this question lists some other options.