Trin Trin - 3 years ago 154
HTML Question

Jsoup - Parsing Selected Elements

I need to parse the below HTML content using Jsoup parser.
The requirement is to eliminate a few tags and get the below output.
I am not able to get the desired output with the below code

Input :



<html>

<head>
<style type=\ "text/css\">
body {
font: 12px Arial, Helvetica, sans-serif
}

tr {
font: 12px Arial, Helvetica, sans-serif;
padding: 0px 0px 0px 10px
}
</style>
</head>

<body>

<p>hello,<br>&nbsp;<br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br>
<img id=\ "logo_GMALE.png\" alt=\ "logo GMALE\" src=\ "https://www.GMALE.ch/logo.png\">

<br><b>Test abc xyz</b><br><br>T +91 98 471 <br>

<a href=\ "mailto:output.test@GMALE.in\">output.test@GMALE.in</a><br><br><b>D├ępartement Team</b><br><br><b>GMALE Assurances</b><br>StreetName 2<br>Postbox 2100<br>Country<br><br>GMALE.ch<br><br>This is a private email contents.<br><br>This e-mail transmission
is intended for the named addressee(s) only. Its contents are private, confidential and protected from disclosure and should not be read, copied or disclosed by any other person. If you are not the intended recipient, we kindly ask you to notify the
sender immediately and to delete this e-mail.<br><br>


</body>
</html>





Output:



<p>hello,<br>&nbsp;<br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br>

<br><b>Test abc xyz</b><br><br>T +91 98 471 <br>





Code done so far is below:

Document doc = Jsoup.parse(content);
List<Node> childNodes = doc.select("body").get(0).childNodes();
System.out.println("Elements : " + childNodes);
StringBuilder finalContent = new StringBuilder();
for (Node node : childNodes) {
if (node instanceof Element) {
Element subElement = (Element) node;
if (!subElement.tagName().equals("img")) {
finalContent.append(subElement);
}
} else {
TextNode textNode = (TextNode) node;
if(!textNode.getWholeText().startsWith("<a")) {
finalContent.append(textNode);
}
}
}

Answer Source

Your problem can be defined as follows: parse body of following HTML and extract all data until reaching <a href=\ "mailto:output.test@GMALE.in\">. If you look at your problem from this angle you can try following approach:

final Document doc = Jsoup.parse(content);
final Elements elements = doc.select("body > *:not(img)");
final Iterator<Element> iterator = elements.iterator();
final StringBuilder finalContent = new StringBuilder();

Element current;
while (iterator.hasNext() && !(current = iterator.next()).tagName().startsWith("a")) {
    finalContent.append(current.toString());
    String siblingText = current.nextSibling().attr("text").trim();
    if (!siblingText.isEmpty()) {
        finalContent.append(siblingText);
    }
}

System.out.println(finalContent);

Firstly we select all elements excluding <img> with selector body > *:not(img). Then we iterate over all elements until we reach the end of the list or we reach first a element. We also check if there is a sibling text node containing any content - this is a case for a phone number since it is not placed inside any HTML tag and it's a sibling to one of the <br> tags.

Running this example generates following output:

 <p>hello,<br>&nbsp;<br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br><br><b>Test_firstname90 Test_lastname90</b><br><br>T +91 98 471<br>

Of course you define different iteration stop rule, this example was created to give you a hint. I hope it helps.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download