mcNogard mcNogard - 5 months ago 73
HTML Question

Extract all visible text from html

I am trying to create a search function in google chrome. Given a string it will highlight all areas containing this string. I use java. I

To do this, first I need to extract all visible text. I have tried to analyze html pages in order to figure out how to extract only text.

For sections that looks like this, it seems

To do this, I planned on using jsoup. I am not sure how to extract text from sections that looks like this. (This is a youtube comment with a "read more" link and "show less" link).

From this section, I try to extract "Not gonna lie, dat dog is ADORABLE" and ("Les mer" or "Vis mindre" depending on which of them is visible).

<div class="comment-renderer-text" tabindex="0" role="article">
<div class="comment-renderer-text-content">Not gonna lie, dat dog is ADORABLE</div>
<div class="comment-text-toggle hid">
<div class="comment-text-toggle-link read-more">
<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
<span class="yt-uix-button-content">Les mer
</span>
</button>
</div>
<div class="comment-text-toggle-link show-less hid">
<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
<span class="yt-uix-button-content">Vis mindre
</span>
</button>
</div>
</div>
</div>

Jop Jop
Answer

I am going to assume that the html code given is already in a document named doc.

String text = doc.select("div.comment-renderer-text-content").first().text();

The doc.select command gets Elements that contain that specified HTML query. Then I get the first one and convert it to text.

More can be read here: Jsoup Selector

Edit:

You can use this code to get visible text rather than per class:

String text = doc.body().text();