eddy eddy - 11 months ago 49
Java Question

Jsoup to extract data from html table

I've started using JSoup today to use for an android app so I have this table which I need to extract data from, but from it seems, it's going to be tough. Need some help; the html for the table is as below:

<TR BGCOLOR='#999999'>
<TD ALIGN='left'><span class='S09W80'><font color=#DDDDDD>CODE</span></TD>
<TD ALIGN='left'><span class='S09W80'><font color=#DDDDDD>SUBJECT NAME</span></TD>
<TD ALIGN='right'><span class='S09W80'><font color=#DDDDDD>PERIOD FROM</span></TD>
<TD ALIGN='right'><span class='S09W80'><font color=#DDDDDD>PERIOD TO</span></TD>
<TD ALIGN='right'><span class='S09W80'><font color=#DDDDDD>ENROL DATE</span></TD>
<TD ALIGN='right'><span class='S09W80'><font color=#DDDDDD>GRADE</span></TD>

followed by repetitions of

<TD ALIGN='left'><span class='S09W50'>IT142</span></TD>
<TD ALIGN='right'><span class='S09W50'>21-FEB-11</span></TD>
<TD ALIGN='right'><span class='S09W50'>17-JUN-11</span></TD>
<TD ALIGN='right'><span class='S09W50'>22-FEB-11</span></TD>
<TD ALIGN='center'><span class='S09W80'>B-</span></TD>

but how do I use the doc.select (what selector to use?); here ?

Answer Source

Not really an Android question, but a CSS selector question. You can read more about it at http://www.w3.org/TR/CSS2/selector.html

Doing screen scraping like this is always tricky and there is no "right" solution.

You will need to perform multiple select steps.

  1. A selector like "body > table > tr". Take the first element. This will give you the initial TR element.
  2. Validate the TR element, get its child elements and validate one of them has the text "SUBJECT NAME".
  3. Then the other TR elements can be processed in order.