Earlz Earlz - 9 days ago 6
C# Question

Using unicode characters bigger than 2 bytes with .Net

I'm using this code to generate

U+10FFFC


var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC});


I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. The problems come when manipulating this unicode character.

If I later do this:

foreach(var ch in s)
{
Console.WriteLine(ch);
}


Instead of it printing just the single character, it prints two characters (i.e. the string is apparently composed of two characters). If I alter my loop to add these characters back to an empty string like so:

string tmp="";
foreach(var ch in s)
{
Console.WriteLine(ch);
tmp += ch;
}


At the end of this,
tmp
will print just a single character.

What exactly is going on here? I thought that
char
contains a single unicode character and I never had to worry about how many bytes a character is unless I'm doing conversion to bytes. My real use case is I need to be able to detect when very large unicode characters are used in a string. Currently I have something like this:

foreach(var ch in s)
{
if(ch>=0x100000 && ch<=0x10FFFF)
{
Console.WriteLine("special character!");
}
}


However, because of this splitting of very large characters, this doesn't work. How can I modify this to make it work?

Answer

U+10FFFC is one Unicode code point, but string's interface does not expose a sequence of Unicode code points directly. Its interface exposes a sequence of UTF-16 code units. That is a very low-level view of text. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available... I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.

First off, I will suggest using char.ConvertFromUtf32 to get your initial string. Much simpler, much more readable:

var s = char.ConvertFromUtf32(0x10FFFC);

So, this string's Length is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. U+10FFFC uses two UTF-16 code units, so s.Length is 2. All code points above U+FFFF require two UTF-16 code units for their representation.

You should note that ConvertFromUtf32 doesn't return a char: char is a UTF-16 code unit, not a Unicode code point. To be able to return all Unicode code points, that method cannot return a single char. Sometimes it needs to return two, and that's why it makes it a string. Sometimes you will find some APIs dealing in ints instead of char because int can be used to handle all code points too (that's what ConvertFromUtf32 takes as argument, and what ConvertToUtf32 produces as result).

string implements IEnumerable<char>, which means that when you iterate over a string you get one UTF-16 code unit per iteration. That's why iterating your string and printing it out yields some broken output with two "things" in it. Those are the two UTF-16 code units that make up the representation of U+10FFFC. They are called "surrogates". The first one is a high/lead surrogate and the second one is a low/trail surrogate. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either.

When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.

And in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence.

The char structure provides static methods to deal with surrogates: IsHighSurrogate, IsLowSurrogate, IsSurrogatePair, ConvertToUtf32, and ConvertFromUtf32. If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units:

static IEnumerable<int> AsCodePoints(this string s)
{
    for(int i = 0; i < s.Length; ++i)
    {
        yield return char.ConvertToUtf32(s, i);
        if(char.IsHighSurrogate(s, i))
            i++;
    }
}

Then you can iterate like:

foreach(int codePoint in s.AsCodePoints())
{
     // do stuff. codePoint will be an int will value 0x10FFFC in your example
}

If you prefer to get each code point as a string instead change the return type to IEnumerable<string> and the yield line to:

yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i));

With that version, the following works as-is:

foreach(string codePoint in s.AsCodePoints())
{
     Console.WriteLine(codePoint);
}