user3198882 user3198882 - 2 months ago 5x
HTML Question

How browser defines, which characters in source HTML, to collapse to single whitespace in rendered HTML? How about \s regex?

I want a function, which will check, if a text node will be collapsed into single whitespace by browser, in the rendered HTML:

function isSingleWhitespace(node) {
var spacesCollapsed = node.textContent.replace(/[ \n\r\t]+/g, ''); // What about \s ?
return spacesCollapsed.length === 0;

Here is the regex101:

Which characters become collapsed to single whitespace when HTML is rendered by browser, does
class suit to find them? As a part of larger regexp?

What about stuff like
? Does
include it? I need to account everything that is not rendered by browser. Regexp solution is not the only acceptable, actually, if that collapsing-to-single-whitespace algorithm has complex specification, which doesn't resolve fine using only RegExp (like true, "hard-core" email validation), then where can I find that algorithm specification? Any link to any implementation, flowchart, a listing of char codes, anything that specifies, how browser defines, which characters will be collapsed to single-whitespace. At least, what to query in google, really. no need for IE < 9.

My use case is: I want to translate caret position between rendered html units and html source units, for wysiwyg editor being built on contenteditable, because when the user presses backspace or delete, it should silently skip those characters, and remove visible one.

myf myf

As for which character are collapsed

space character in HTML5 are defined:

The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

so any subsequent character from this group is collapsed and leading/trailing trimmed in most cases (1), so your regexp seems fine.

As for is there common API to get the "rendered" content

Seems you are reading textContent - it provides actual "source" formatting.

If you used innerText instead, you'd get what you probably want - provided you are in DOM context and in capable environment. See The poor, misunderstood innerText by Kangax.

(1) behaviour depends on CSS and / or node type: for instance <pre> or anything with white-space: pre keeps white space while <p> or anything with white-space: normal gets subsequent space characters collapsed and trimmed.

Try example below:

<p id="p1"> 1  2   3  </p>
document.write( p1.innerText.split(''))

<p id="p2" style="white-space: pre"> 1  2   3  </p>
document.write( p2.innerText.split(''))