David Bowling David Bowling - 1 year ago 44
C Question

Is there a simple, portable way to determine the ordering of two characters in C?

According to the standard:

The values of the members of the execution character set are implementation-defined.

(ISO/IEC 9899:1999 5.2.1/1)

Further in the standard:

...the value of each character after
in the above list of decimal digits shall be one greater than the value of the previous.

(ISO/IEC 9899:1999 5.2.1/3)

It appears that the standard requires that the execution character set includes the 26 uppercase and 26 lowercase letters of the Latin alphabet, but I see no requirement that these characters be ordered in any way. I only see an order stipulation for the decimal digits.

This would seem to imply that, strictly speaking, there is no guarantee that
'a' < 'b'
. Now, the letters of the alphabet are in order in each of ASCII, UTF-8, and EBCDIC. But for ASCII and UTF-8 we have
'A' < 'a'
, while for EBCDIC we have
'a' < 'A'

It might be nice to have a function in
that compares alphabetic characters portably. Short of this or something similar, it seems to me that one must look in the locale to find the value of
and proceed accordingly, but this doesn't seem simple.

My gut tells me that this is almost never an issue; for most cases alphabetical characters can be handled by converting to lowercase, because for the most commonly used character sets the letters are in order.

The question: given two chars

char c1;
char c2;

is there a simple, portable way to determine if
alphabetically? Or do we assume that the lowercase and uppercase characters always occur in sequence, even though this does not appear to be guaranteed by the standard?

To clarify any confusion, I am really just interested in the 52 letters of the Latin alphabet that are guaranteed by the standard to be in the execution character set. I realize that other sets of letters are important, but it seems that we can't even know about the ordering of this small subset of letters.


I think that I need to clarify a bit more. The issue, as I see it, is that we commonly think of the 26 lowercase letters of the Latin alphabet as being ordered. I would like to be able to assert that 'a' comes before 'b', and we have a convenient way of expressing this in code as
'a' < 'b'
, when we give 'a' and 'b' integral values. But the standard gives no assurances that the above code will evaluate as expected. Why not? The standard does guarantee this behavior for the digits 0-9, and this seems sensible. If I want to determine if one letter-char precedes another, say for sorting purposes, and if I want this code to be truly portable, it seems like the standard offers no help. Now I have to rely on the convention that ASCII, UTF-8, EBCDIC, etc. have adopted that
'a' < 'b'
should be true. But this isn't really portable unless the only character sets used rely on this convention; this may be true.

This question originated for me in another question thread: Check if a letter is before or after another letter in C. Here, a few people suggested that you could determine the order of two letters stored in
s using inequalities. But one commenter pointed out that this behavior is not guaranteed by the standard.

Answer Source

For A-Z,a-z in a case-insensitive manner (and using compound literals):

char ch = foo();
az_rank = strtol((char []){ch, 0}, NULL, 36);

For 2 char that are known to be A-Z,a-z but may be ASCII or EBCDIC.

int compare2alpha(char c1, char c2) {
  int mask = 'A' ^ 'a';  // Only 1 bit is different between upper/lower
  return (c1 | mask) - (c2 | mask);

Alternatively, if limited to 256 differ char, could use a look-up table that maps the char to its rank. Of course the table is platform dependent.