PyreneesJim PyreneesJim - 3 months ago 34
C# Question

Why isn't `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x`

In .NET why isn't it true that:

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))


returns the original byte array for an arbitrary byte array
x
?

It is mentioned in answer to another question but the responder doesn't explain why.

Answer

Character encodings (UTF8, specificly) may have different forms for the same code point.

So when you convert to a string and back, the actual bytes may represent a different (canonical) form.

See also String.Normalize(NormalizationForm.System.Text.NormalizationForm.FormD)

See also:

Some Unicode sequences are considered equivalent because they represent the same character. For example, the following are considered equivalent because any of these can be used to represent "αΊ―":

"\u1EAF" 
"\u0103\u0301" 
"\u0061\u0306\u0301" 

However, ordinal, that is, binary, comparisons consider these sequences different because they contain different Unicode code values. Before performing ordinal comparisons, applications must normalize these strings to decompose them into their basic components.

That page comes with a nice sample that shows you what encodings are always normalized

Comments