view raw
BCS BCS - 1 year ago 56
HTML Question

How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints:

  • I know the document's are reasonably sane

  • they are very regular (they all came from the same source

  • I want about 99% of the visible text

  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)

  • I don't care about formatting or even paragraph breaks.

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.


You need to use the HTML Agility Pack.

You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.