Jared Smith Jared Smith - 11 days ago 6
Javascript Question

Need to normalize UTF-8 string encoding to composed characters

So I have some characters like í, ñ, etc. that are percent-encoded in a URL string in an XML document. I need to convert them programmatically from the combining form (e.g. i%CC%81) to their composed UTF-8 character equivalent (%C3%AD in that case).

SO was kind enough to point me to the same question about how to do this in iOS (you can't, you have to create your own lookup table) and C# (apparently you can do this in the general case with built-in functionality in C#).

I need to be able to do it in python 3.x and preferably, JavaScript as well. So far I have tried to

unquote
/
decodeURI
the string and then re-encode it back, but apparently the characters are not exactly equivalent because the transforms are lossless (I get back the original starting with either form).

Is there anyway to do this in the general case or do I need to build my own lookup table and replacement functions? Also, here's an example URL:

file:///some/file/path/3-05%20Melodi%CC%81a%20de%20la%20montan%CC%83a%20.m4a


(Obviously I'm unescaping the XML part).

UPDATE



Using Christoph's answer below got me the python solution and enabled me to find this for JavaScript (note that it is an ES 2015 function, has mediocre browser support with no IE and Safari 10 only).

Answer

In python3 urllib.quote moved to urllib.parse, but you're actual looking for unicodedata.normalize()

Coming from a default python3 string

import urllib.parse
import unicodedata

s = "î"
print (urllib.parse.quote(s))
> %C3%AE

s = unicodedata.normalize("NFC",s)
print (urllib.parse.quote(s))

> %C3%AD

which looks to me pretty much like the result you're looking for.