Morix Dev Morix Dev - 5 months ago 10
jQuery Question

jQuery encoding in html()

Take the following (simple) HTML page:

<html>
<head>
<script src="jquery-1.12.3.min.js"></script>
</head>
<body>
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
</body>
</html>


If in browser console I type something like:

$("#test").html()


I obtain:


<img src="/path/to/image?width=1024&amp;height=768">


Why has the
&
in
img
source attribute has been turned to
&amp;
?

I can understand if the ampersand appears in a paragraph text (or something like that)... but why are image sources touched that way? This is going to break the page for further processing...

Isn't there a way for obtaining "raw" HTML out from a
<div/>
?

Answer

Why has the & in img source attribute has been turned to &amp;?

Because it should have been &amp; in the first place; the browser fixed it for you when it parsed the HTML, because browsers are tolerant. :-)

The text inside an HTML attribute is HTML text. In HTML text, both < and & must be encoded, because they both have special values: < is the beginning of a tag, and & is the beginning of a character entity. The typical way to encode them is with named character entities: &lt; and &amp; (> is also frequently written &gt;, but it's not necessary outside a tag). If you have a & that the browser's parser determines doesn't start a character entity, the parser backs up and acts as though it saw &amp; instead. The HTML5 specification addresses doing this in ยง8.2.4.2: The & puts the parser in the "data state" and the parser attempts to consume a character reference; it falls back to processing it as a literal & if it fails to consume a character reference.

So the browser fixed it, and then jQuery retrieved the corrected version and that's what gets logged to the console.

This is going to break the page for further processing...

Nothing that correctly processes HTML text will be impacted by this, nor will anything that deals with just the value of that attribute rather than the HTML text that defines the value of it.

For instance, if you ask that img element what its src is, you'll get back a string with just an & in it:

var img = document.querySelector("#test img");
console.log(img.getAttribute("src"));
console.log(img.src);
<div id='test'>
  <img src='/path/to/image?width=1024&height=768' />
</div>

That's because both src and getAttribute return the string, not the way we write the string in HTML.

Similarly, anything using attribute matching selectors will work as well.

// src*="&height" means "an element with a src attribute
// containing &height anywhere in the value
var img = document.querySelector('img[src*="&height"]');
console.log("Found it? " + (img ? "true" : "false"));
<div id='test'>
  <img src='/path/to/image?width=1024&height=768' />
</div>

&amp; is only used in the HTML text defining that attribute in HTML. If a tool is processing the HTML text, it needs to correctly understand HTML text.