brandonscript brandonscript - 1 month ago 19
Javascript Question

Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

I'm using the Javascript

window.atob()
function to decode a base64-encoded string (specifically the base64-encoded content from the GitHub API). Problem is I'm getting ASCII-encoded characters back (like
â¢
instead of
). How can I properly handle the incoming base64-encoded stream so that it's decoded as utf-8?

Answer

There's a great article on Mozilla MDN that describes exactly this issue:

The "Unicode Problem" Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a Unicode string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit ASCII-encoded character. There are two possible methods to solve this problem:

  • the first one is to escape the whole string and then encode it;
  • the second one is to convert the UTF-16 DOMString to an UTF-8 array of characters and then encode it.

A note on the original answer: previously, the MDN article suggested using unescape and escape to solve the Character Out Of Range exception problem. Some of the other answers have suggested working around this (or even not escaping the original string at all), but doing so isn't 100% reliable; in fact, using encodeURIComponent in place of escape doesn't work at all.

Here's the current MDN recommendation that uses a regular expression in place of the deprecated unescape function for encoding from UTF8 to base64:

function utf8_to_b64(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode('0x' + p1);
    }));
}

console.log(utf8_to_b64('✓ à la mode'));
// 4pyTIMOgIGxhIG1vZGU=

Sadly any myriad of combinations of regex escaping doesn't work when decoding:

console.log(b64_to_utf8('4pyTIMOgIGxhIG1vZGU='))
// â à la mode

In the end, save yourself some agony and just use a library:


The original solution, using escape and unescape (which are now deprecated, though this still works in all modern browsers):

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

And one last thing: I first encountered this problem when calling the GitHub API. To get this to work on (Mobile) Safari properly, I actually had to strip all white space from the base64 source before I could even decode the source:

function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}