user3341332 user3341332 - 1 year ago 143
Java Question

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.

This is the Java code I am using:

String resultText = scrapePage(htmldoc);

private String scrapePage(Document doc) {
Element allHTML ="html").first();
return allHTML.text();

Run against the following HTML:

<p>here is para1</p>
<p>here is para2</p>

Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".

I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching

(e.g. data1data2 would come from):


Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.

Answer Source

I don't have this issue using JSoup 1.7.3.

Here's the full code i used for testing:

final String html = "<html>\n"
        + "  <body>\n"
        + "    <h1>Title</h1>\n"
        + "    <p>here is para1</p>\n"
        + "    <p>here is para2</p>\n"
        + "  </body>\n"
        + "</html>";

Document doc = Jsoup.parse(html);

Element element ="html").first();


And the output:

Title here is para1 here is para2

Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.