Johannes Johannes - 3 months ago 12
C# Question

Can the encoding difference for printable characters between utf-8 and Latin-1 be resolved?

I read that there should be no difference between Latin-1 and UTF-8 for printable characters. I thought that a latin-1

'Ä'
would map twice into utf-8.
Once to the Multi byte Version and once directly.

Why does it seem like this is not the case?

It certainly seems like the standard could include anything that looks like a continuation byte but is not a continuation as the meaning within latin-1 without loosing anything.

Am I just missing a flag or something that would allow me to convert the data like described, or am I missing the bigger picture?

Here is a C# example:

The output on my system is

enter image description here

static void Main(string[] args)
{
DecodeTest("ascii7", " ~", new byte[] { 0x20, 0x7E });
DecodeTest("Latin-1", "Ä", new byte[] { 0xC4 });
DecodeTest("UTF-8", "Ä", new byte[] { 0xc3, 0x84 });
}

private static void DecodeTest(string testname, string expected, byte[] encoded)
{
var utf8 = Encoding.UTF8;
string ascii7_actual = utf8.GetString(encoded, 0, encoded.Length);
//Console_Write(encoded);
AssertEqual(testname, expected, ascii7_actual);
}

private static void AssertEqual(string testname, string expected, string actual)
{
Console.WriteLine("Test: " + testname);
if (actual != expected)
{
Console.WriteLine("\tFail");
Console.WriteLine("\tExpected: '" + expected + "' but was '" + actual + "'");
}
else
{
Console.WriteLine("\tPass");
}
}

private static void Console_Write(byte[] ascii7_encoded)
{
bool more = false;
foreach (byte b in ascii7_encoded)
{
if (more)
{
Console.Write(", ");
}
Console.Write("0x{0:X}", b);
more = true;
}
}

Answer

I read that there should be no difference between Latin-1 and UTF-8 for printable characters.

You read wrong. There is no difference between Latin-1 (and many other encodings including the rest of the ISO 8859 family) and UTF-8 for characters in the US-ASCII range (U+0000 to U+007F). They are different for all other characters.

I thought that a latin-1 'Ä' would map twice into utf-8. Once to the Multi byte Version and once directly.

For this to be possible would require UTF-8 to be stateful or to otherwise use information earlier in the stream to know whether to interpret an octet as a direct mapping or part of the multibyte encoding. One of the great advantages of UTF-8 is that it is not stateful.

Why does it seem like this is not the case?

Because it's just plain wrong.

It certainly seems like the standard could include anything that looks like a continuation byte but is not a continuation as the meaning within latin-1 without loosing anything.

It couldn't do so without loosing the quality of not being stateful, which would mean corruption would destroy the entire text following the error rather than just one character.

Am I just missing a flag or something that would allow me to convert the data like described, or am I missing the bigger picture?

No, you just have a completely incorrect idea about how UTF-8 and/or Latin-1 works.

A flag would remove UTF-8's simplicity in being non-stateful and self-synchronising (you can always tell immediately if you are at a single-octet character, the start of a character or part-way into a character) as mentioned above. It would also remove UTF-8's simplicity in being algorithmic. All UTF-8 encodings map as follows.

To map from code-point to encoding:

  1. Consider the bits of the character xxxx… e.g. for U+0027 they are 100111 for U+1F308 they are 11111001100001000.

  2. Find the smallest of the following they will fit into:

    0xxxxxxx

    110xxxxx 10xxxxxx

    1110xxxx 10xxxxxx 10xxxxxx

    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So U+0027 is 00100111 is 0x27 and U+1F308 is 11110000 10011111 10001100 10001000 is 0xF0 0x9F 0x8C 0x88.

To go from octets to code-points you undo this.

To map to Latin 1 you just put the character into a octet, (which obviously only works if they are in the range U+0000 to U+00FF).

As you can see, there's no way that a character outside of the range U+0000 to U+007F can have matching encodings in UTF-8 and Latin-1.

There is a way that a character could theoretically have more than one UTF-8 encoding, but it is explicitly banned. Consider that instead of putting the bits of U+0027 into the single unit 00100111 we could also zero-pad and put it into 11000000 10100111 encoding it as 0xC0 0xA7. The same decoding algorithm would bring us back to U+0027 (try it and see). However as well as introducing needless complexity in having such synonym encodings this also introduced security issues and indeed there have been real-world security holes caused by code that would accept over-long UTF-8.