David Goldfarb David Goldfarb - 6 months ago 31
jQuery Question

jQuery - Extract just text from complex html page

My jQuery AJAX client is interfacing with an API which usually returns JSON but, in some situations, returns a complex, pretty, human-readable web page.

For debugging purposes, I'd like to log these cases, saving just the text from the page. I thought this would be a trivial

$(result).text()
, but that seems to retain a lot of the non-text components too, particularly the contents of stylesheet refs.

For example:

$('<html>Some text<style>body { height: 100%;}</style><script type="text/javascript">function f() { return 42;}</script>some more text</body></html>').text();


gives

"Some textbody { height: 100%;}function f() { return 42;}some more text"


where I'd like to see

"Some textsomemore text"


Second example (edited in later), because this needs to search recursively:

<html>abc<script>f=3;</script><div>def<script>g=7</script>ghi</div></html>


Should return:

"abcdefghi"


without either
f=3
or
g=7
.

What's the easiest way to get just the text? I don't need this to be perfect, nor to handle hairy edge cases; just not to flood my log with hundreds of lines of JavaScript and CSS.

=-=-=-=

Note: the accepted answer works in many contexts, but not all; see my comments to it. It's not clear if the problem has to do with jQuery version, something weird in Chrome extensions or, most likely, something messed up in my environment. The symptom of the failing context is that filter does not remove matching elements if they are nested inside other elements.

Answer

You can filter the elements what you don't want to be participated in text extraction like "Script, Style" etc.

Try this:

var str1 = '<html>Some text<style>body { height: 100%;}</style><script type="text/javascript">function f() { return 42;}</script>some more text</body></html>';
var str2 = '<html>abc<script>f=3;</script><div>def<script>g=7</script>ghi</div></html>';

function extractText(htmlString){
    return $(htmlString).filter(function(i, elm){ 
        return !$(elm).is("script, style");
    }).text();
}

console.log(extractText(str1)); // "Some textsome more text"
console.log(extractText(str2)); // "abcdefghi"