Samuel Samuel - 2 months ago 6
C++ Question

UTF Encoding for "Ü" returns 3 bytes instead of the "real" unicode

I was playing around with the code mentioned in:
http://stackoverflow.com/a/21575607/2416394 as I have issues writing proper utf8 xml with TinyXML.

Well, I need to encode the "LATIN CAPITAL LETTER U WITH DIAERESIS", which is

Ü
to be properly written to XML etc.

Here is the code take from the post above:

std::string codepage_str = "Ü";
int size = MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), nullptr, 0 );
std::wstring utf16_str( size, '\0' );
MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), &utf16_str[ 0 ], size );

int utf8_size = WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), nullptr, 0,
nullptr, nullptr );
std::string utf8_str( utf8_size, '\0' );
WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[ 0 ], utf8_size,
nullptr, nullptr );


The result is an std::string which has the size of 3 with the following bytes:

- utf8_str "Ü" std::basic_string<char,std::char_traits<char>,std::allocator<char> >
[size] 0x0000000000000003 unsigned __int64
[capacity] 0x000000000000000f unsigned __int64
[0] 0x55 'U' char
[1] 0xcc 'Ì' char
[2] 0x88 'ˆ' char


When I write it into an utf8 file. The hex values remain there:
0x55 0xCC 0x88
and Notepad++ shows me the proper char
Ü
.

However when I add another
Ü
to the file via Notepad++ and save it again then the newly written
Ü
is displayed as
0xC3 0x9C
(which I've actually expected in the first place).

I do not understand, why I get a 3 byte representation of this character and not the expected unicode codepoint U+00DC.

Although Notepad++ displays it correctly, our proprietary system renders
0xC3 0x 9C
as
Ü
and breaks on
0x55 0xCC 0x88
by rendering
Ü
not recognizing it as a two byte utf 8

Answer

Unicode is complicated. There are at least two different ways to get Ü:

1."LATIN CAPITAL LETTER U WITH DIAERESIS" is Unicode codepoint U+00DC.

2.The letter "U" is Unicode codepoint U+0055, and "COMBINING DIAERESIS" is Unicode codepoint U+0308.

Both display as a capital U with diaeresis.

Unicode codepoint U+00DC is encode as 0xc3 0x9cc in UTF8, "U" is 0x55 in UTF8, and U+0308 is 0xcc 0x88 in UTF8.

Your proprietary system seems to have a bug.

Edit: to get what you expect, according to https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx , use MB_PRECOMPOSED instead of MB_COMPOSITE.

Comments