CharithJ CharithJ - 1 month ago 13
C# Question

How select the right codepage to decode the content encoded by CArchive

In .net I want to decode some raw data encoded by a C++ application. C++ application is 32 bit and C# application is 64bit.

C++ application supports Russian and Spanish characters, but it doesn't support unicode characters. This C# binary reader fails to read Russian or spanish characters and works only for English ascii characters.

CArchive doesn't specify any encoding and I am not sure how to read it from C#.

I've tested this for couple of simple strings this is what C++ CArchive provides :

For "ABC" : "03 41 42 43"

For "ÁåëÀÇ 7555Â" : "0B C1 E5 EB C0 C7 20 37 35 35 35 C2"

The following shows how the C++ application write the binary.

void CColumnDefArray::SerializeData(CArchive& Archive)
{
int iIndex;
int iSize;
int iTemp;
CString sTemp;

if (Archive.IsStoring())
{
Archive << m_iBaseDataCol;
Archive << m_iNPValueCol;

iSize = GetSize();
Archive << iSize;
for (iIndex = 0; iIndex < iSize; iIndex++)
{
CColumnDef& ColumnDef = ElementAt(iIndex);
Archive << (int)ColumnDef.GetColumnType();
Archive << ColumnDef.GetColumnId();
sTemp = ColumnDef.GetName();
Archive << sTemp;
}
}
}


And this is how I am trying to read it in C#.

The following can decode "ABC" but not the Russian charactors. I've tested
this.Encoding
with all available options (Ascii, UTF7 and etc). Russian characters works only for Encoding.Default. But apparently that's not a reliable option as encoding and decoding usually happens in different PCs.

public override string ReadString()
{
byte blen = ReadByte();
if (blen < 0xff)
{
// *** For russian characters it comes here.***
return this.Encoding.GetString(ReadBytes(blen));
}

var slen = (ushort) ReadInt16();
if (slen == 0xfffe)
{
throw new NotSupportedException(ServerMessages.UnicodeStringsAreNotSupported());
}

if (slen < 0xffff)
{
return this.Encoding.GetString(ReadBytes(slen));
}

var ulen = (uint) ReadInt32();
if (ulen < 0xffffffff)
{
var bytes = new byte[ulen];
for (uint i = 0; i < ulen; i++)
{
bytes[i] = ReadByte();
}

return this.Encoding.GetString(bytes);
}

//// Not support for 8-byte lengths
throw new NotSupportedException(ServerMessages.EightByteLengthStringsAreNotSupported());
}


What is the correct approach to decode this? Do you think selecting the right code page is the way to solve this? If so how to know which code page was used to encode?

Appreciate if someone can show me the right direction to get this done.

Edit

I guess this Question and "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article solve some doubts. Apparently there is no way to find the right code page for existing data.

I guess now the question is: Is there any code page that support all Spanish, Russian and English characters? Can I specify the code page in C++ CArchive class?

Answer

The non-Unicode C++ program writes the data as 0B C1 E5 EB C0 C7 20 37 35 35 35 C2 (string's length, followed by bytes)

"ÁåëÀÇ 7555Â" is the representation of bytes in code page 1252

On English language computer, the following code returns "ÁåëÀÇ 7555Â". This works if both programs use the same code page:

string result = Encoding.Default.GetString(bytes);

You can also use code page 1252 directly. This will guarantee that the result is always "ÁåëÀÇ 7555Â" for that specific set of bytes:

//result will be `"ÁåëÀÇ 7555Â"`, always
Encoding cp1252 = Encoding.GetEncoding(1252);
string result = cp1252.GetString(bytes);



However this may not solve any problem. Consider an example with Greek text:

string greek = "ελληνικά";
Encoding cp1253 = Encoding.GetEncoding(1253);
var bytes = cp1253.GetBytes(greek);

bytes will be the similar to the output from the C++ program. You can use the same technique to extract text:

//result will be "åëëçíéêÜ"
Encoding cp1252 = Encoding.GetEncoding(1252);
string result = cp1252.GetString(bytes);

The result is "åëëçíéêÜ". But the desired result is "ελληνικά"

//result will be "ελληνικά"
Encoding cp1253 = Encoding.GetEncoding(1253);
string greek_decoded = cp1253.GetString(bytes);

So in order to do the correct conversion you must have the original code page which the C++ program was using (I am just repeating Hans Passant)

You can make the following modification:

public override string ReadString()
{
    //Default code page if both programs use the same code page
    Encoding encoder = System.Text.Encoding.Default;

    //or find out what code page the C++ program is using
    //Encoding encoder = System.Text.Encoding.GetEncoding(codepage);

    //or use English code page to always get "ÁåëÀÇ 7555Â"...
    //Encoding encoder = System.Text.Encoding.GetEncoding(1252);
    //(not recommended)

    byte blen = ReadByte();
    if (blen < 0xff)
        return encoder.GetString(ReadBytes(blen));

    var slen = (ushort)ReadInt16();
    if (slen == 0xfffe)
        throw new NotSupportedException(
            ServerMessages.UnicodeStringsAreNotSupported());

    if (slen < 0xffff)
        return encoder.GetString(ReadBytes(blen));

    var ulen = (uint)ReadInt32();
    if (ulen < 0xffffffff)
    {
        var bytes = new byte[ulen];
        for (uint i = 0; i < ulen; i++)
            bytes[i] = ReadByte();
        return encoder.GetString(ReadBytes(blen));
    }

    throw new NotSupportedException(
        ServerMessages.EightByteLengthStringsAreNotSupported());
}

Additional comments:

The non-Unicode MFC program can take input in English or Russian, but not both languages at the same time. These old programs use char to store up to 255 letters per byte. 255 is not enough room for all the alphabets in English, Russian, Greek, Arabic...

Code page 1252 maps the characters to Latin alphabets. While code page 1253 maps the characters to Greek alphabet and so on.

Therefore your MFC file contains only one language of one code page.

Western European languages (English, Spanish, Portuguese, German, French, Italian, Swedish, etc.) use code page 1252. If users stay within this language group then there should not be much trouble. System.Text.Encoding.Default should solve the problem, or better yet System.Text.Encoding.GetEncoding(variable_codepage)

There are some relevant ANSI code pages in Windows

874 – Windows Thai
1250 – Windows Central and East European Latin 2
1251 – Windows Cyrillic
1252 – Windows West European Latin 1
1253 – Windows Greek
1254 – Windows Turkish
1255 – Windows Hebrew
1256 – Windows Arabic
1257 – Windows Baltic
1258 – Windows Vietnamese

Some Asian languages are not supported without Unicode. Some Unicode symbols are not supported in ANSI, nothing can be done about that.

It is possible to force the non-unicode program to use more than one code page. But it is not practical. It is much easier to upgrade to Unicode and do this right.

See also The Minimum Software Developers Must Know About Unicode

Comments