user3341332 user3341332 - 3 months ago 59
Java Question

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.

This is the Java code I am using:

String resultText = scrapePage(htmldoc);

private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}


Run against the following HTML:

<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>


Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".

I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching

(e.g. data1data2 would come from):

<td>data1</td><td>data2</td>


Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.

Answer

I don't have this issue using JSoup 1.7.3.

Here's the full code i used for testing:

final String html = "<html>\n"
        + "  <body>\n"
        + "    <h1>Title</h1>\n"
        + "    <p>here is para1</p>\n"
        + "    <p>here is para2</p>\n"
        + "  </body>\n"
        + "</html>";

Document doc = Jsoup.parse(html);

Element element = doc.select("html").first();

System.out.println(element.text());

And the output:

Title here is para1 here is para2

Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.

Comments