My question is sort of like this question but I have more constraints:
- I know the document's are reasonably sane
- they are very regular (they all came from the same source
- I want about 99% of the visible text
- about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
- I don't care about formatting or even paragraph breaks.
Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?
I'm open to command line or batch processing tools as well as C/C#/D libraries.