drogaleggera drogaleggera - 3 months ago 25
HTML Question

How to parse HTML text and links with java and jsoup

I need to parse text from a webpage. The text is presented in this way:

nonClickableText= link1 link2 nonClickableText2= link1 link2


I want to be able to convert all to a string in java. The non clickable text should remain like it is while the clickable text should be replaced with its actual link.

So in java I would have this:

String parsedHTML = "nonClickableText= example.com example.com nonClickableText2= example3.com example4.com";


Here are some pictures: first second

Answer

What exactly is link1 and link2? According to your example

"... nonClickableText2= example3.com example4.com"

they can be different, so what would be the source besides the href?

Based on you images the following code should give you everything to adopt your final string presentation. First we grab the <strong>-block and then go through the child nodes, using <a>-children with preceding text-nodes:

String htmlString = "<html><div><p><strong>\"notClickable1\"<a rel=\"nofollow\" target=\"_blank\" href=\"example1.com\">clickable</a>\"notClickable2\"<a rel=\"nofollow\" target=\"_blank\" href=\"example2.com\">clickable</a>\"notClickable3\"<a rel=\"nofollow\" target=\"_blank\" href=\"example3.com\">clickable</a></strong></p></div></html>";

Document doc = Jsoup.parse(htmlString); //can be replaced with Jsoup.connect("yourUrl").get();
String parsedHTML = "";

Element container = doc.select("div>p>strong").first();

for (Node node : container.childNodes()) {
    if(node.nodeName().equals("a") && node.previousSibling().nodeName().equals("#text")){
        parsedHTML += node.previousSibling().toString().replaceAll("\"", "");
        parsedHTML += "= " + node.attr("href").toString() + " ";
    }
}
parsedHTML.trim();

System.out.println(parsedHTML);

Output:

notClickable1= example1.com notClickable2= example2.com notClickable3= example3.com