remio remio - 11 months ago 104
C# Question

How to convert a UTF-8 string into Unicode?

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

For now, my implementation is the following:

public static string DecodeFromUtf8(this string utf8String)
// read the string as UTF-8 bytes.
byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

// convert them into unicode bytes.
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

// builds the converted string.
return Encoding.Unicode.GetString(encodedBytes);

I am playing with the word
. I have converted it into UTF-8 through this online tool, and so I started to test my method with the string

Unfortunately, with this implementation the string just remains the same.

Where am I wrong?

Answer Source

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);