Gastón Schabas Gastón Schabas - 5 days ago 5
Java Question

jsoup join some nodes and wrap it in an element

I'm new at Jsoup. I'm trying to modify the following example.

<div>
text that <string>need</strong> to be <strong>wrapped</strong>
<p>a text that has to be ignored</p>
another text that <string>need</strong> to be <strong>wrapped</strong>
</div>


to obtain this

<div>
<p>text that <string>need</strong> to be <strong>wrapped</strong></p>
<p>a text that has to be ignored</p>
<p>another text that <string>need</strong> to be <strong>wrapped</strong></p>
</div>


so, I need to wrap all texts that are not inside a <p> with a <p>

I've tryed something like this

Document doc = Jsoup.parse(html);
doc.body().traverse(new NodeVisitor() {
@Override
public void head(Node node, int depth) {
if(node instanceof TextNode && Arrays.asList("div","body").contains(node.parentNode().nodeName())) {
Node auxNode = node;
node.replaceWith(pNode);
node.childNodes();

while (auxNode.nextSibling() != null && Arrays.asList("em", "strong").contains(auxNode.nextSibling().nodeName())) {
node.after(auxNode);
auxNode.remove();
auxNode = node.nextSibling();
}
node.wrap("<p></p>");
}
}

@Override
public void tail(Node node, int depth) { }
});


But I just keep getting a NullPointerException in the while condition.

Thanks in advance

java.lang.NullPointerException
at HTMLToArticleParser$1.head(HTMLToArticleParser.java:52)
at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:31)
at org.jsoup.nodes.Node.traverse(Node.java:536)
at HTMLToArticleParser.parse(HTMLToArticleParser.java:47)
at HTMLToArticleParser_Tests.jTest(HTMLToArticleParser_Tests.java:188)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:117)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:262)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:84)

Answer

thanks to everyone. I could solve it doing this

class NewNode

public class NewNode {

    private Element newElement = new Element(Tag.valueOf("p"), "");
    private List<Node> childs;

    public NewNode(List<Node> childs) {
        this.childs = childs;
    }

    public Node getNewNode() {
        childs.forEach(child -> newElement.appendChild(child.clone()));
        return newElement;
    }

}

class NodesToProcess

public class NodesToProcess {

    private Node oldNode;
    private NewNode newNode;
    private List<Node> toRemove;
    public NodesToProcess(Node oldNode, NewNode newNode, List<Node> toRemove) {
        this.oldNode = oldNode;
        this.newNode = newNode;
        this.toRemove = toRemove;
    }

    public Node getOldNode() {
        return oldNode;
    }

    public Node getNewNode() {
        return newNode.getNewNode();
    }

    public List<Node> getToRemove() {
        return toRemove;
    }

}

and this method is the one who wrap text that are not wrapped

private void wrapUnwrappedTextInTagP(Element element) {
    List<NodesToProcess> nodesToProcesses = new ArrayList<>();
    List<Node> nodeAlreadyUsed = new ArrayList<>();

    element.childNodes().forEach(node -> {
        if(node instanceof TextNode && !nodeAlreadyUsed.contains(node)) {
            List<Node> newChilds = new ArrayList<>();
            List<Node> toRemove = new ArrayList<>();

            newChilds.add(node);
            nodeAlreadyUsed.add(node);
            Node auxNode = node.nextSibling();

            while (auxNode != null && parentIsBodyAndIsAnTextElement(auxNode)) {
                newChilds.add(auxNode);
                nodeAlreadyUsed.add(auxNode);
                toRemove.add(auxNode);
                auxNode = auxNode.nextSibling();
            }
            nodesToProcesses.add(new NodesToProcess(node, new NewNode(newChilds), toRemove));
        }
    });

    nodesToProcesses.forEach(nodesToProcess -> {
        nodesToProcess.getOldNode().replaceWith(nodesToProcess.getNewNode());
        nodesToProcess.getToRemove().forEach(node -> node.remove());
    });
}

so, in the main method

Document doc = Jsoup.parse(html);
wrapUnwrappedTextInTagP(doc.body());
Comments