Igor Liferenko Igor Liferenko - 16 days ago 9
C Question

Why there are no "unsigned wchar_t" and "signed wchar_t" types?

The signedness of char is not standardized. Hence there are

signed char
and
unsigned char
types. Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char (this
type was chosen to be
int
), because if the argument type was
char
, we would
get type conversion warnings from the compiler (if -Wconversion is used) in code like this:

int c;
c = getchar();
if (c != EOF) islower((unsigned char) c);

warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result


And the thing which makes it work without explicit typecasting is automatic promotion
from
char
to
int
.

Further, the ISO C90 standard, where
wchar_t
was introduced, does not say anything
specific about the representation of
wchar_t
.

Some quotations from glibc reference:


it would be legitimate to define
wchar_t
as
char






if
wchar_t
is defined as
char
the type
wint_t
must be defined as
int
due to the parameter promotion.


So,
wchar_t
can perfectly well be defined as
char
, which means that similar rules
for wide character types must apply, i.e., there may be implementations where
wchar_t
is positive, and there may be implementations where
wchar_t
is negative.
From this it follows that there must exist
unsigned wchar_t
and
signed wchar_t
types (for the same reason as there are
unsigned char
and
signed char
types).

Private communication reveals that an implementation is allowed to support wide
characters with >=0 value only (independently of signedness of
wchar_t
). Anybody knows what this means? Does thin mean that when
wchar_t
is 16-bit
type (for example), we can only use 15 bits to store the value of wide character?
In other words, is it true that a sign-extended
wchar_t
is a valid value?
See also this question.

Also, private communication reveals that the standard requires that any valid value of
wchar_t
must
representable by
wint_t
. Is it true?

Consider this example:

#include <locale.h>
#include <ctype.h>
int main (void)
{
setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");

/* 11111111 */
char c = 'ÿ';

if (islower(c)) return 0;
return 1;
}


To make it portable, we need the cast to '(unsigned char)'.
This is necessary because
char
may be the equivalent
signed char
,
in which case a byte where the top bit is set would be sign
extended when converting to
int
, yielding a value that is outside
the range of
unsigned char
.

Now, why is this scenario different from the following example for
wide characters?

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "");
wchar_t wc = L'ÿ';

if (iswlower(wc)) return 0;
return 1;
}


We need to use
iswlower((unsigned wchar_t)wc)
here, but
there is no
unsigned wchar_t
type.

Why there are no
unsigned wchar_t
and
signed wchar_t
types?

Answer

TL;DR:

Why there are no unsigned wchar_t and signed wchar_t types?

Because C's wide-character handling facilities were defined such that they are not needed.


In more detail,

The signedness of char is not standardized.

To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." (C2011, 6.2.5/15)

Hence there are signed char and unsigned char types.

"Hence" implies causation, which would be hard to argue clearly, but certainly signed char and unsigned char are more appropriate when you want to handle numbers, as opposed to characters. In particular, note that whereas the standard classifies char, signed char, and unsigned char all as character types, it classifies only the latter two as integer types.

Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char

No, not at all. Standard library functions that work with individual characters could easily be defined in terms of type char, regardless of whether that type is signed, because the library implementation does know its signedness. If that were a problem then it would apply equally to the string functions, too -- char would be useless.

Your example of getchar() is non-apposite. It returns int rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion from int to unsigned char, but no conversion from char to unsigned char.

Some other character-handling functions accept int parameters or return values of type int both for compatibility with getchar() and other stdio functions, and for historic reasons. In days of yore, you couldn't actually pass a char at all -- it would always be promoted to int, and that is what the functions would (and must) accept. One cannot later change the argument type, evolution of the language notwithstanding.

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t.

C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes wchar_t as

an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales [...].

Your quotations from the glibc reference are non-authoritative, except possibly for glibc only. They appear in any case to be commentary, not specification, and its unclear why you raise them. Certainly, though, at least the first is correct. Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a char then that implementation could define wchar_t as char. Such implementations used to be much more common than they are today.

You ask several questions:

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means?

I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. You will find that in practice, character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard.

Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character?

The C standard does not say or imply that. You can store the value of any supported character in a wchar_t. In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in a wchar_t.

In other words, is it true that a sign-extended wchar_t is a valid value?

The C standard does not say or imply that. It does not even say whether wchar_t is a signed type (if not, then sign extension is meaningless for it). If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. The same is true of adding 1 to a wchar_t value.

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t. Is it true?

It depends what you mean by "valid". The standard says that wint_t

is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set.

(C2011, 7.29.1/2)

wchar_t must be able to hold any value corresponding to a member of the extended character set, in any supported locale. wint_t must be able to hold all of those values, too. It may be, however, that wchar_t is capable of representing values that do not correspond to any character in any supported character set. Such values are valid in the sense that the type can represent them. wint_t is not required to be able to represent such values. For example, if no extended character set of any supported locale uses character codes greater than 32767, then an implementation would be free to implement wchar_t as an unsigned 16-bit integer, and wint_t as a signed 16-bit integer.

With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. The char classification functions are defined to work with the same values that getchar() is defined to return -- either -1 or a character value converted, if necessary, to unsigned char. The wide character classification functions, on the other hand, accept arguments of type wint_t, which can represent the values of all wide-character unchanged, therefore there is no need for a conversion.

You claim in this regard that

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type.

No and maybe. You do not need to convert the wchar_t argument to iswlower() to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. As for unsigned wchar_t, C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations.