Ralph Tandetzky Ralph Tandetzky - 4 months ago 23
C++ Question

How well is Unicode supported in C++11?

I've read and heard that C++11 supports Unicode. A few questions on that:


  • How well does the C++ standard library support Unicode?

  • Does
    std::string
    do what it should?

  • How do I use it?

  • Where are potential problems?


Answer

How well does the C++ standard library support unicode?

Terribly.

A quick scan through the library facilities that might provide Unicode support gives me this list:

  • Strings library
  • Localization library
  • Input/output library
  • Regular expressions library

I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.

Does std::string do what it should?

Yes. According to the C++ standard, this is what std::string and its siblings should do:

The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.

Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.

Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.

How do I use it?

Use it as a sequence of char objects; pretending it is something else is bound to end in pain.

Where are potential problems?

All over the place? Let's see...

Strings library

The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.

It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.

Localization library

The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.

Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:

template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...

How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"

Comments