Igor Liferenko Igor Liferenko - 1 month ago 17
C Question

Why functions from wctype.h do not work without setlocale()?

My setup: glibc 2.24, gcc 6.2.0, UTF-8 environment.

Consider the following example:

#include <wchar.h>
#include <wctype.h>
#include <locale.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
wchar_t wc = L'я'; /* 00000100 01001111 */
if (iswlower(wc)) return 0;
return 1;
}


Compile and run it:

$ gcc test.c
$ ./a.out; echo $?
0


Now remove
setlocale()
and run again. The result is different:

$ gcc test.c
$ ./a.out; echo $?
1


Technically,
setlocale()
is not needed here, because functions from
wctype.h
work with wide characters, which have a fixed encoding. (It goes without saying that
setlocale()
is required if we want functions from
ctype.h
to work correctly with non-ASCII characters, and if we use character conversion functions from wchar.h - to set external encoding.)

Why the example doesn't work without
setlocale()
?

Answer

The C standard says:

7.25 Wide character classification and mapping utilities <wctype.h>

...

The behavior of these functions is affected by the LC_CTYPE category of the current locale.

Moreover (5.2.1 Character sets)

Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters.

and then (7.19 Common definitions <stddef.h>)

wchar_t which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales

So there may be many extended character sets, one for each locale. Thus, wchar_t encoding may be locale dependent, because an encoding is a mapping between a set of integer codes and a set of characters, and the latter is potentially locale dependent.

Given the above, <wctype.h> must be locale dependent. Otherwise the standard would have to mandate that there is a single locale independent extended character set.

In this particular example, the value of the wide character constant L'я' (some integer code) may or may not correspond to any member of the extended character set under C locale.

As for specific behaviour of gcc and glibc, they always use Unicode/ISO10646/UCS4 as the extended character set for simplicity, under any locale. However they do not classify extended characters under C locale because they don't have to, as the standard permits. (A wild guess follows) Full Unicode classification tables are large and programs that only need ASCII don't have to pay for their use.