BCS BCS - 1 year ago 100
HTML Question

How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints:


  • I know the document's are reasonably sane

  • they are very regular (they all came from the same source

  • I want about 99% of the visible text

  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)

  • I don't care about formatting or even paragraph breaks.



Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.

Answer Source

You need to use the HTML Agility Pack.

You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download