M.Mac M.Mac - 1 month ago 14
Java Question

How to extract elements from a String with jsoup?

I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.

<a href="/wiki/Kategorie:Herrscher_des_Mittelalters" title="Kategorie:Herrscher des Mittelalters">Herrscher des Mittelalters</a>


In this case I am searching for
Herrscher des Mittelalters
.

My code reads the first line of a .txt file with the
BufferedReader
.

BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));

Document doc = Jsoup.parse(r.readLine());
Element elem = doc;


I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.

Any suggestions?

Additional information: My .txt file contains full Wikipedia HTML pages.

Answer

This should get you all titles from links. You can split the titles further as you need:

    Document d = Jsoup.parse("<a href=\"/wiki/Kategorie:Herrscher_des_Mittelalters\" title=\"Kategorie:Herrscher des Mittelalters\">Herrscher des Mittelalters</a>");

    Elements links = d.select("a");

    Set<String> categories = new HashSet<>();
    for (Element script : links) {
        String title = script.attr("title");
        if (title.length() > 0) {
            categories.add(title);
        }

    }

    System.out.println(categories);