Nico Hoppel Nico Hoppel - 1 month ago 9
Java Question

Web Crawler Amazon get span-Element

I'm crawling amazon categories and I get the salesrank and the product URLs. Now I want to crawl the category and I get every information from the category span.

<span class="zg_hrsr_ladder">in&nbsp;<a href="https://www.amazon.de/gp/bestsellers/books/ref=pd_zg_hrsr_b_1_1">B&uuml;cher</a> &gt; <a href="https://www.amazon.de/gp/bestsellers/books/287480/ref=pd_zg_hrsr_b_1_2">Krimis & Thriller</a> &gt; <b><a href="https://www.amazon.de/gp/bestsellers/books/419954031/ref=pd_zg_hrsr_b_1_3_last">Deutschland</a></b></span>


This is an example code snippet and with following code

Elements category = htmlDocument.select("span.zg_hrsr_ladder");


I get everything inside the span. But I want only the text inside the a href "Bücher" "Krimis & Thriller" and "Deutschland". How can I get this information?

Answer

You want to get the text inside the <a> element, so select anchors in your span (append " a" to the selector) and call text() and the resulting elements.

Example Code

String source = "<span class=\"zg_hrsr_ladder\">in&nbsp;<a href=\"https://www.amazon.de/gp/bestsellers/books/ref=pd_zg_hrsr_b_1_1\">B&uuml;cher</a> &gt; <a href=\"https://www.amazon.de/gp/bestsellers/books/287480/ref=pd_zg_hrsr_b_1_2\">Krimis & Thriller</a> &gt; <b><a href=\"https://www.amazon.de/gp/bestsellers/books/419954031/ref=pd_zg_hrsr_b_1_3_last\">Deutschland</a></b></span>";

Document htmlDocument = Jsoup.parse(source, "UTF-8");

Elements category = htmlDocument.select("span.zg_hrsr_ladder a");

category.forEach(aElement -> {
    System.out.println(aElement.text());
});

Output

Bücher
Krimis & Thriller
Deutschland