H.Fadlallah H.Fadlallah - 9 days ago 6
HTML Question

Error while parsing html table into dataset

I am trying to parse tables from html files into dataset. I created a windows application using vb.net and i am using the following function:

Public Function GetDataSet(ByVal strWebFilePath As String) As DataSet

Dim html As String = System.IO.File.ReadAllText(strWebFilePath)

html = WebUtility.HtmlDecode(html)

Dim dsHtml As New DataSet
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html)

Dim tables = htmldoc.DocumentNode.SelectNodes("//table//tr") _
.GroupBy(Function(x) x.Ancestors("table").First())

For i As Integer = 0 To tables.Count - 1
Dim rows = tables(i).ToList()
dsHtml.Tables.Add(String.Format("Table {0}", i))

Dim headers = rows(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList()

If headers.Count > 0 Then

For Each Hr In headers
dsHtml.Tables(i).Columns.Add(Hr)
Next

For j As Integer = 1 To rows.Count - 1
Dim row = rows(j)
Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()
dsHtml.Tables(i).Rows.Add(dr)
Next

Else

headers = rows(0).Elements("td").Select(Function(x) x.InnerText.Trim).ToList()

For ColumnIndex As Integer = 0 To headers.Count - 1
dsHtml.Tables(i).Columns.Add("F" & ColumnIndex.ToString)
Next

For j As Integer = 0 To rows.Count - 1
Dim row = rows(j)
Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()
dsHtml.Tables(i).Rows.Add(dr)
Next


End If

Next

Return dsHtml
End Function


Everything is working fine until i am parsing html table that it's first row contains a column having
colspan=2
(first row is considered a header even if it not containing
<th>
). So it throws this exception:


An unhandled exception of type 'System.ArgumentException' occurred in System.Data.dll
Additional information: Input array is longer than the number of columns in this table.


Consider this table example:

<table>
<tr><td colspan=2>Links</td></tr>
<tr><td>1</td><td>www.stackoverflow.com</td></tr>
<tr><td>2</td><td>www.sqlservercentral.com</td></tr>
<tr><td>3</td><td>www.dba.stackexchange.com/</td></tr>
</table>


Is there a way to split the first row into 2 columns :


  • the first containing
    Links

  • The second containing auto-generated value (i.e.:
    Col1
    )


Answer

You need some conventions about how to resolve column spans, row spans, missing cells, extra cells, inconsistent values in cells and so on. In general It's hard to parse an html table to a DataTable without knowing the structure of data table.

In this answer I'll focus on column span on header.

What's the problem?

You are relying on <th> counts in first <tr> and add columns to data table based on found <th>. So if you have a table like this:

<table>
    <tr>
        <th colspan="2">A</th>
        <th>B</th>
    </tr>
    <tr>
        <td>1</td>
        <td>11</td>
        <td>111</td>
    </tr>
</table>

Then you suppose you have 2 columns and when adding rows, since you have 3 elements in each row, you receive the exception.

How can I solve the problem?

Here I decided to convert <th colspan="n">C</th> to n columns with names C1, C2, ... , Cn.

Dim headers = rows(0).Elements("th").Select(Function(x) _
    New With
    {
        .Name = x.InnerText.Trim,
        .Count = If(x.Attributes("colspan") Is Nothing, _
            1, Integer.Parse(x.Attributes("colspan").Value))
    }).ToList()

Then when adding columns to data table:

For Each Hr In headers
    For index = 1 To Hr.Count
        Dim postFix = If(Hr.Count > 1, index.ToString(), "")
        dsHtml.Tables(i).Columns.Add(Hr.Name & postFix)
    Next
Next