Simon Simon - 1 month ago 17
Java Question

Jsoup: parse url links separately

I use jsoup to parse all url links from a content string, which is working well.

Part of the content string with the urls, as you see the links are presented after the text "Download Instructions:", "Mirror:" and "Additional:":

<u>Download Instructions:</u><br/>
<a class="postlink" href="https://test.com/info">https://test.com/info</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://global.eu/navi.html">http://global.eu/navi.html</a>
<br/>Additional:<br/>
<a class="postlink" href="http://main.org/navi.html">http://main.org/navi.html</a>


Now my goal is to parser all urls (can be multiple) after the text "Download Instructions:" and the text "Mirror:" separately, urls after "Additional" should be ignored.

Below code piece only parses them all and adds them to a (url) arraylist.

int j = 0;
Document doc = Jsoup.parse(content);
Elements links = doc.select("a.postlink");
for (Element el : links) {
String urlman = el.attr("abs:href");
if (urlman != null) {
url.add(j, urlman);
j++;
}
}


Would be great if somebody could assist.

Thank you in advance.

Answer

Based on your posted structure you can check the sibling node with a sibling index reduced by two to get the textnodes describing the anchors. Then simply do some form of String comparison.

Example Code

String source = "<u>Download Instructions:</u><br/><a class=\"postlink\" href=\"https://test.com/info\">https://test.com/info</a><br/>Mirror:<br/><a class=\"postlink\" href=\"http://global.eu/navi.html\">http://global.eu/navi.html</a><br/>Additional:<br/><a class=\"postlink\" href=\"http://main.org/navi.html\">http://main.org/navi.html</a>";

Document doc = Jsoup.parse(source, "UTF-8");

String downloadInstructionsUrl = "";
String mirrorUrl = "";

for (Element el : doc.select("a.postlink")) {

    String identifier = el.previousSibling().previousSibling().toString();

    if(identifier.contains("Download Instructions")){
        downloadInstructionsUrl = el.attr("abs:href");
    }else if(identifier.toString().contains("Mirror")){
        mirrorUrl = el.attr("abs:href");
    }
}

System.out.println("Url for download instructions: " + downloadInstructionsUrl);
System.out.println("Url for mirror: " + mirrorUrl);

Output

Url for download instructions: https://test.com/info
Url for mirror: http://global.eu/navi.html