user884248 user884248 - 1 month ago 7
C# Question

C#: read the first char of a string, when that char's unicode value is > 65535

I have a C# method that needs to retrieve the first character of a string, and see if it exists in a HashSet that contains specific unicode characters (all the right-to-left characters).

So I'm doing

var c = str[0];


and then checking the hashset.

The problem is that this code doesn't work for strings where the first char's code point is larger than 65535.

I actually created a loop that goes through all numbers from 0 to 70,000 (the highest RTL code point is around 68,000 so I rounded up), I create a byte array from the number, and use

Encoding.UTF32.GetString(intValue);


to create a string with this character. I then pass it to the method that searches in the HashSet, and that method fails, because when it gets

str[0]


that value is never what it should be.

What am I doing wrong?

Answer

To anyone who sees this question in the future and is interested in the solution I ended up with - this is my method which decides if a string should be displayed RTL or LTR based on the first character in the string. It takes UTF-16 Surrogate Pairs into account.

Thanks to Tom Blodget who pointed me in the right direction.

if (string.IsNullOrEmpty(str)) return null;

var firstChar = str[0];
if (firstChar >= 0xd800 && firstChar <= 0xdfff)
{
    // if the first character is between 0xD800 - 0xDFFF, this is the beginning
    // of a UTF-16 surrogate pair. there MUST be one more char after this one,
    // in the range 0xDC00-0xDFFF. 
    // for the very unreasonable chance that this is a corrupt UTF-16 string
    // and there is no second character, validate the string length
    if (str.Length == 1) return FlowDirection.LeftToRight;

    // convert surrogate pair to a 32 bit number, and check the codepoint table
    var highSurrogate = firstChar - 0xd800;
    var lowSurrogate = str[1] - 0xdc00;
    var codepoint = (highSurrogate << 10) + (lowSurrogate) + 0x10000;

    return _codePoints.Contains(codepoint)
        ? FlowDirection.RightToLeft
        : FlowDirection.LeftToRight;
}
return _codePoints.Contains(firstChar)
    ? FlowDirection.RightToLeft
    : FlowDirection.LeftToRight;