H.Fadlallah H.Fadlallah - 1 month ago 11
Vb.net Question

Parsing multiple html table with different structure to a dataset

I used the following code to parse multiple html table from a saved web page to a datatable (the table must have the same structure) (using Html-Agility-Pack):

Imports System.Net

Public Sub ParseHtmlTable(byval HtmlFilePath as String)

Dim webStream As Stream
Dim webResponse = ""
Dim req As FileWebRequest
Dim res As FileWebResponse

req = WebRequest.Create("file:///" & HtmlFilePath)

req.Method = "GET" ' Method of sending HTTP Request(GET/POST)

res = req.GetResponse ' Send Request

webStream = res.GetResponseStream() ' Get Response

Dim webStreamReader As New StreamReader(webStream)

Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(webStreamReader.ReadToEnd())

Dim nodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//table/tr")

Dim dtTable As New DataTable("Table1")

Dim Headers As List(Of String) = nodes(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList

For Each Hr In Headers

dtTable.Columns.Add(Hr)

Next

For Each node As HtmlAgilityPack.HtmlNode In nodes

Dim Row = node.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray

dtTable.Rows.Add(Row)

Next

dtTable.WriteXml("G:\1.xml", XmlWriteMode.WriteSchema)

End Sub


but i cannot parse multiple html table having different structure to a dataset(multiple table) like this Page any suggestions??

Answer

Currently the main problem of your code is you try to process all tr elements of different tables in a single pass. Those tr elements belong to different tables with different column counts and should be parsed in different pass.

You can use different solutions to solve the problem and in all solutions, you should process rows of different tables separately.

For example you can use a group by on rows and group them by table and then process rows of each table separately:

Public Function GetDataSet(html As String) As DataSet
    Dim ds As DataSet = New DataSet
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument
    htmldoc.LoadHtml(html)
    Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
                                     .GroupBy(Function(x) x.ParentNode)
    For i As Integer = 0 To tables.Count - 1
        Dim rows = tables(i).ToList()
        ds.Tables.Add(String.Format("Table {0}", i))
        Dim headers = rows(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList()
        For Each Hr In headers
            ds.Tables(i).Columns.Add(Hr)
        Next
        For j As Integer = 1 To rows.Count - 1
            Dim row = rows(j)
            Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()
            ds.Tables(i).Rows.Add(dr)
        Next
    Next
    Return ds
End Function

And here is the usage:

Dim html = System.IO.File.ReadAllText("D:\file.html")
Dim ds = GetDataSet(html)

Note

  • Above code is just an example and in a real world application you need some null checking and exception handling.
  • I also used HTML Agility Pack to parse html. But since you used also web request and web response, you should be aware in some web scrapping tasks you need to work with DOM after the scripts executed and the response which you receive using web request and web response will not be useful for you. In this cases you can simply use a web browser control like WebBrower and simply query DOM for tables and for each table extract rows.

Sample Input File

And here is the sample input file which I used to test:

<html>
<head><title>Test</title></head>
<body>
    <div>Contents:</div>
    <table>
        <tr>
            <th>Column1</th> <th>Column2</th>
        </tr>
        <tr>
            <td>1</td> <td>11</td>
        </tr>
        <tr>
            <td>2</td> <td>22</td>
        </tr>
    </table>
    <table>
        <tr>
            <th>Column1</th> <th>Column2</th> <th>Column3</th>
        </tr>
        <tr>
            <td>a</td> <td>aa</td> <td>aaa</td>
        </tr>
        <tr>
            <td>b</td> <td>bb</td> <td>bbb</td>
        </tr>
    </table>
</body>
</html>
Comments