mimosa mimosa - 29 days ago 4x
C++ Question

XMLCh to wchar_t and vice versa

My config:

  • Compiler: gnu gcc 4.8.2

  • I compile with C++11

  • platform/OS: Linux 64bit Ubuntu 14.04.1 LTS

I want to feed a method with wchar_t* and use it in many xecerces library methods that need XMLCh* but I don't know how to translate from one to another. It's easy if you use char* instead of wchar_t* but I need to use wide character. Under windows I could easily cast from one to another but it doesn't work in my linux machine. Somehow I have to manually translate wchar_t* to a XMLCh*

I link throught the library libxerces-c-3.1.so which uses XMLCh* exclusively. XMLCh can deal with wide character, but I don't know how to feed it to it, and also how to get a wchar_t* back from a XMLCh*

I developed this but it doesn't work (here I spit out a wstring which is easier to manage in cleaning up the memory than a pointer:

static inline std::wstring XMLCh2W(const XMLCh* tagname)
std::wstring wstr;
XMLSize_t len1 = XMLString::stringLen(tagname);
XMLSize_t outLen = len1 * 4;
XMLByte ut8[outLen+1];
XMLSize_t charsEaten = 0;
XMLTransService::Codes failReason; //Ok | UnsupportedEncoding | InternalFailure | SupportFilesNotFound
XMLTranscoder* transcoder = XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8", failReason,16*1024);

unsigned int utf8Len = transcoder->transcodeTo(tagname,len1,ut8,outLen,charsEaten,XMLTranscoder::UnRep_Throw);// XMLTranscoder::UnRep_Throw UnRep_RepChar

ut8[utf8Len] = 0;
std::wstring wstr = std::wstring((wchar_t*)ut8);//I'm not sure this is actually ok to do
return wstr;

Tom Tom

No, you can't do that under GCC, because GCC defines wchar_t as a 32-bit, UTF-32/UCS-4-encoded (the difference is not important for practical purposes) string while Xerces-c defines XmlCh as a 16-bit UTF-16-encoded string.

The best I've found is to use the C++11 support for UTF-16 strings:

  • char16_t and XmlCh are equivalent, though not implicitly convertible; you still need to cast between them. But at least this is cheap, compared to transcoding.
  • std::basic_string<char16_t> is the equivalent string type.
  • Use literals of the form u"str" and u's'.

Unfortunately, VC++ doesn't support the C++11 UTF-16 literals, though wchar_t literals are UTF-16 encoded. So I end up with something like this in a header:

#if defined _MSC_VER
#define U16S(x) L##x
typedef wchar_t my_u16_char_t;
typedef std::wstring my_u16_string_t;
typedef std::wstringstream my_u16_sstream_t;
inline XmlCh* XmlString(my_u16_char_t* s) { return s; }
inline XmlCh* XmlString(my_u16_string_t* s) { return s.c_str(); }
#elif defined __linux
#define U16S(x) u##x
typedef char16_t my_u16_char_t;
typedef std::basic_string<my_u16_char_t> my_u16_string_t;
typedef std::basic_stringstream<my_u16_char_t> my_u16_sstream_t;
inline XmlCh* XmlString(my_u16_char_t* s) { return reinterpret_cast<XmlCh*>(s); }
inline XmlCh* XmlString(my_u16_string_t* s) { return XmlString(s.c_str()); }

It is, IMO, rather a mess, but not one I can see getting sorted out until VC++ supports C++11 Unicode literals, allowing Xerces to be rewritten in terms of char16_t directly.