Ahmed Ahmed Ahmed Ahmed - 4 months ago 15
Java Question

java jsoup: retrieving links from article

<article itemprop="articleBody">
<p channel="wp.com" class="interstitial-link">
<i>
[<a href="www.URL.com" shape="rect">Link Text</a>]
</i>
</p>
<article>


How would I retrieve the URL and Link text with Jsoup from this HTML doc?
I want it to look like this

"Link Text[URL]"

Edit: I want to retrieve only the links within

<article itemprop="articleBody"> ... <article>


Not the entire page. Also, I want all the links within, not just one.

Answer
    // connect to URL and retrieve source code as document
    Document doc = Jsoup.connect(url).get();

    // find the link element in the article
    Element link = doc
            .select("article[itemprop=articleBody] p.interstitial-link i a")
            .first();

    // extract the link text
    String linkText = link.ownText();

    // extract the full url of the href
    // use this over link.attr("href") to avoid relative url
    String linkURL = link.absUrl("href");


    // display
    System.out.println(
            String.format(
                    "%s[%s]", 
                    linkText,
                    linkURL));

Read more about CSS Selectors


You could also iterate each link in the article like this:

    for (Element link : doc.select("article[itemprop=articleBody] a")) {
        String linkText = link.ownText();
        String linkURL = link.absUrl("href");
        System.out.println(
                String.format(
                        "%s[%s]", 
                        linkText,
                        linkURL));
    }

Output

Link Text[http://www.URL.com]