H.Fadlallah H.Fadlallah - 1 year ago 61
HTML Question

Error while parsing html table into dataset

I am trying to parse tables from html files into dataset. I created a windows application using vb.net and i am using the following function:

Public Function GetDataSet(ByVal strWebFilePath As String) As DataSet

Dim html As String = System.IO.File.ReadAllText(strWebFilePath)

html = WebUtility.HtmlDecode(html)

Dim dsHtml As New DataSet
Dim htmldoc As New HtmlAgilityPack.HtmlDocument

Dim tables = htmldoc.DocumentNode.SelectNodes("//table//tr") _
.GroupBy(Function(x) x.Ancestors("table").First())

For i As Integer = 0 To tables.Count - 1
Dim rows = tables(i).ToList()
dsHtml.Tables.Add(String.Format("Table {0}", i))

Dim headers = rows(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList()

If headers.Count > 0 Then

For Each Hr In headers

For j As Integer = 1 To rows.Count - 1
Dim row = rows(j)
Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()


headers = rows(0).Elements("td").Select(Function(x) x.InnerText.Trim).ToList()

For ColumnIndex As Integer = 0 To headers.Count - 1
dsHtml.Tables(i).Columns.Add("F" & ColumnIndex.ToString)

For j As Integer = 0 To rows.Count - 1
Dim row = rows(j)
Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()

End If


Return dsHtml
End Function

Everything is working fine until i am parsing html table that it's first row contains a column having
(first row is considered a header even if it not containing
). So it throws this exception:

An unhandled exception of type 'System.ArgumentException' occurred in System.Data.dll
Additional information: Input array is longer than the number of columns in this table.

Consider this table example:

<tr><td colspan=2>Links</td></tr>

Is there a way to split the first row into 2 columns :

  • the first containing

  • The second containing auto-generated value (i.e.:

Answer Source

You need some conventions about how to resolve column spans, row spans, missing cells, extra cells, inconsistent values in cells and so on. In general It's hard to parse an html table to a DataTable without knowing the structure of data table.

In this answer I'll focus on column span on header.

What's the problem?

You are relying on <th> counts in first <tr> and add columns to data table based on found <th>. So if you have a table like this:

        <th colspan="2">A</th>

Then you suppose you have 2 columns and when adding rows, since you have 3 elements in each row, you receive the exception.

How can I solve the problem?

Here I decided to convert <th colspan="n">C</th> to n columns with names C1, C2, ... , Cn.

Dim headers = rows(0).Elements("th").Select(Function(x) _
    New With
        .Name = x.InnerText.Trim,
        .Count = If(x.Attributes("colspan") Is Nothing, _
            1, Integer.Parse(x.Attributes("colspan").Value))

Then when adding columns to data table:

For Each Hr In headers
    For index = 1 To Hr.Count
        Dim postFix = If(Hr.Count > 1, index.ToString(), "")
        dsHtml.Tables(i).Columns.Add(Hr.Name & postFix)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download