Vishal Zanzrukia Vishal Zanzrukia - 5 months ago 33
Java Question

Not able to achieve something with Jsoup HTML parser Java

I am not able to parse some text for following scenarios using Jsoup Java Library.

1 :

This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag1</b> other text
.

Expected output :
some other <b> </b> text as well <b></b>


2 :
This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text
.

Expected output :
some other <b> </b> text as well <b></b>


3 :
This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text <b></b> <b>non empty tag3</b>
.

Expected output :
some other <b> </b> text as well <b></b>


Here, if you have noticed the text My Text is fix (static) but the second non empty (don't consider space as value) B tag value may vary. The regex should be able to extract the text between the
<b>My Text</b>
and the first occurrence non empty
<b>
tag after that.

I am using Jsoup library, but not able to achieve the above expected output. Please make sure that solution should be common for each scenario, because it's dynamic in my case.

Answer

Simple solution could look like

  • find <b> element which you are interested in (the one with text you are looking for)
  • iterate over siblings placed after it and print them until you find non empty <b>

You just need to remember that Jsoup is using Node to store all elements (including text which doesn't belong to tags), while Element class (which extends Node) may contain only specific tags.

So for instance text like "before <b>bold</b> after<i>italic</i>" will be represented as

<node>before</node>
<element tag="b">
   <node>bold</node>
</element>
<node>after</node>
<element tag="i">
   <node>italic</node>
</element>

So if for instance you select("b") (which will find <element tab="B">) and call nextElementSibling() it will move you to <element tag="I">. To get <node>after</node> you will need to use nextSibling() which doesn't eliminate simple text nodes.

Possible problem with Node class is that it doesn't provide text() method which can generate textual content of current node (which could allow us to test if current node/element has any text). But nothing stops us from casting Node which handles tag to Element which provides such method.

So our solution could look like:

public static String findFragment(String html, String fixedStart) {

    Document doc = Jsoup.parse(html);
    Element myBTag = doc
            .select("b:matches(^" + Pattern.quote(fixedStart) + "$)")
            .first();

    StringBuilder sb = new StringBuilder();
    boolean foundNonEmpty = false;

    Node currentSibling = myBTag.nextSibling();
    while (currentSibling != null && !foundNonEmpty) {
        if (currentSibling.nodeName().equals("b")) {
            Element b = (Element) currentSibling;
            if (!b.text().trim().isEmpty())
                foundNonEmpty = true;
        }
        sb.append(currentSibling.toString());
        currentSibling = currentSibling.nextSibling();
    }

    return sb.toString();
}
Comments