Niklas Niklas - 1 month ago 7
Java Question

Extract Data out of table with JSoup

I want to extract this table with the JSoup-framework to save the content in a "table"-array. The first tr-tag is the table header. All followings (not included) describe the content.

<table style=h2 width=100% cellspacing="0" cellpadding="4" border="1" bgColor="#FFFFFF">
<tr>
<td align="left" bgcolor="#9999FF" >
<!-- 0 -->
Kl.
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 3 -->
Std.
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 4 -->
Lehrer
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 5 -->
Fach
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 6 -->
Raum
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 7 -->
VLehrer
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 8 -->
VFach
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 9 -->
VRaum
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 13 -->
Info
</td>
</tr>
<tr>
<!-- 1 0 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 3 -->
<td align="left" bgcolor="#FFFFFF" >
4
</td>
<!-- 1 4 -->
<td align="left" bgcolor="#FFFFFF" >
Méta
</td>
<!-- 1 5 -->
<td align="left" bgcolor="#FFFFFF" >
HU
</td>
<!-- 1 6 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 7 -->
<td align="left" bgcolor="#FFFFFF" >
Shne
</td>
<!-- 1 8 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 9 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 13 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
</tr>


I already tested this one and some others, but I didn't arrive them to work for me:
Using JSoup To Extract HTML Table Contents

Answer

Here's some example code how you can select only the header:

Element tableHeader = doc.select("tr").first();


for( Element element : tableHeader.children() )
{
    // Here you can do something with each element
    System.out.println(element.text());
}

You get the Document by ...

  1. parsing a file: Document doc = Jsoup.parse(f, null); (where f is the File and null the charset, please see jsoup documentation for mor infos)

  2. parsing a website: Document doc = Jsoup.connect("http://your.url.here").get(); (don't miss the http://)

The output:

Kl.
Std.
Lehrer
Fach
Raum
VLehrer
VFach
VRaum
Info

Now, if you need an array (or better List) of all entries you can create a new class where all informations of each entry is stored. Next you parse the Html via jsoup and fill all fields of the class as well as adding it to list.

// Note: all values are strings - you'll need to use better types (int, enum whatever) here. But for an example its enough.
public class Entry
{
    private String klasse;
    private String stunde;
    private String lehrer;
    private String fach;
    private String raum;
    private String vLehrer;
    private String vFach;
    private String vRaum;
    private String info;


    // constructor(s) and getter / setter

    /*
     * Btw. it's a good idea using two constructors here: one with all arguments and one empty. So you can create a new instance without knowing any data and add it with setter-methods afterwards.
     */
}

Next the code wich fills your entry (incl. the list where they are stored):

List<Entry> entries = new ArrayList<>();        // All entries are saved here
boolean firstSkipped = false;                   // Used to skip first 'tr' tag


for( Element element : doc.select("tr") )       // Select all 'tr' tags from document
{
     // Skip the first 'tr' tag since it's the header
    if( !firstSkipped )
    {
        firstSkipped = true;
        continue;
    }

    int index = 0;                              // Instead of index you can use 0, 1, 2, ...
    Entry tableEntry = new Entry();
    Elements td = element.select("td");         // Select all 'td' tags of the 'tr'

    // Fill your entry
    tableEntry.setKlasse(td.get(index++).text());
    tableEntry.setStunde(td.get(index++).text());
    tableEntry.setLehrer(td.get(index++).text());
    tableEntry.setFach(td.get(index++).text());
    tableEntry.setRaum(td.get(index++).text());
    tableEntry.setvLehrer(td.get(index++).text());
    tableEntry.setvFach(td.get(index++).text());
    tableEntry.setInfo(td.get(index++).text());

    entries.add(tableEntry);                    // Finally add it to the list
}

If you use your html from the first post you'll get this output:

[Entry{klasse= , stunde=4, lehrer=Méta, fach=HU, raum= , vLehrer=Shne, vFach= , vRaum=null, info= }]

Note: I simply used System.out.println(entries); for that. So the format of the output is from the toString() Method of Entry.


Please see Jsoup documentation and especially the one for jsoup selector api.