Bill Bill - 2 months ago 9
Java Question

use Jsoup to extract html text, but returns unexpected result

I'm trying to write a web spider to extract the text from the html page, and I use Jsoup to parse the html, the simple code like below:

File file = new File("test2.html");
Document doc = Jsoup.parse(file, "utf-8");
System.out.println(doc.select("body").text());


the test2.html shown below:

enter image description here
the output is:


hellothis is a simple testtest link
<!-- test here -->
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
<li>test4</li>
<li>test5</li>
<li>test6</li>
</ul>

it seems that Jsoup take the code in textarea as all text.
How can I remove all the html tages, only keep the real text?

Answer

As fairjm has pointed out, this is the expected behavior.

If you inspect the textarea element with jsoup, you will find:

Solution

If you really just want to strip any tags - even if they are intentionally in textfields - double parse the content (side-note: of course double parsing costs performance, but otherwise it should not have more drawbacks when only the text is targeted):

File file = new File("test2.html");
Document doc = Jsoup.parse(file, "utf-8");
System.out.println(Jsoup.parse(doc.select("body").text(), "UTF-8").text());

Output

hellothis is a simple testtest link test1 test2 test3 test4 test5 test6