Vb.net Question

getting text off webpage (NOT HTML SOURCE)

how would i put the contents of a webpage into a string?

it would be the same thing as hitting ctrl+A and copying and pasting it.

is there a way to do this programmatically without 'sendkeys' ?

i do not want to look at the html source at all, i just want to copy the text on the site

Answer Source

I've done a fair bit of screen scraping for applications and have found this to be invaluable: https://github.com/MindTouch/SGMLReader

There is a bit of sample code on that page but I've added a bit extra here that will return exactly what you want

Imports System.Xml
Imports System.IO
Imports System.Net
Imports System.Text

Function FromHtml(ByVal reader As TextReader) As XmlDocument
    '' setup SgmlReader   
    Dim sgmlReader As Sgml.SgmlReader = New Sgml.SgmlReader()
    sgmlReader.DocType = "HTML"
    sgmlReader.WhitespaceHandling = WhitespaceHandling.None
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
    sgmlReader.InputStream = reader
    '' create document 
    Dim doc As XmlDocument = New XmlDocument()
    doc.PreserveWhitespace = True
    doc.XmlResolver = Nothing
    Return doc
End Function

Function LoadWebText(ByVal URL As String) As String
    Dim objWebClient As New WebClient()
    Dim objUTF8 As New UTF8Encoding()

    Dim xml As New XmlDocument
    xml = FromHtml(New StringReader(objUTF8.GetString(objWebClient.DownloadData(URL))))

    Return xml.InnerText()

End Function