Simon Simon - 10 months ago 54
Java Question

Jsoup: parse url links separately

I use jsoup to parse all url links from a content string, which is working well.

Part of the content string with the urls, as you see the links are presented after the text "Download Instructions:", "Mirror:" and "Additional:":

<u>Download Instructions:</u><br/>
<a class="postlink" href=""></a>
<a class="postlink" href=""></a>
<a class="postlink" href=""></a>

Now my goal is to parser all urls (can be multiple) after the text "Download Instructions:" and the text "Mirror:" separately, urls after "Additional" should be ignored.

Below code piece only parses them all and adds them to a (url) arraylist.

int j = 0;
Document doc = Jsoup.parse(content);
Elements links ="a.postlink");
for (Element el : links) {
String urlman = el.attr("abs:href");
if (urlman != null) {
url.add(j, urlman);

Would be great if somebody could assist.

Thank you in advance.

Answer Source

Based on your posted structure you can check the sibling node with a sibling index reduced by two to get the textnodes describing the anchors. Then simply do some form of String comparison.

Example Code

String source = "<u>Download Instructions:</u><br/><a class=\"postlink\" href=\"\"></a><br/>Mirror:<br/><a class=\"postlink\" href=\"\"></a><br/>Additional:<br/><a class=\"postlink\" href=\"\"></a>";

Document doc = Jsoup.parse(source, "UTF-8");

String downloadInstructionsUrl = "";
String mirrorUrl = "";

for (Element el :"a.postlink")) {

    String identifier = el.previousSibling().previousSibling().toString();

    if(identifier.contains("Download Instructions")){
        downloadInstructionsUrl = el.attr("abs:href");
    }else if(identifier.toString().contains("Mirror")){
        mirrorUrl = el.attr("abs:href");

System.out.println("Url for download instructions: " + downloadInstructionsUrl);
System.out.println("Url for mirror: " + mirrorUrl);


Url for download instructions:
Url for mirror: