Dave Dave - 1 month ago 7
C Question

Unicode characters in C

Does the C standard require that compilers be able to deal with files not encoded as ascii? Specifially, I am wondering whether utf-8 files are standards compliant. Does the answer to the previous question differ between C89, C99 and C11?

Assuming that it is legal to use characters from outside of ASCII in C source files, which usages are legal?

I can think of a few distinct use cases:


  1. Within comments

  2. Within strings

  3. Within identifiers

  4. Within macro names



Here is an example showing all four:

#ifdef PRINT_©
// Print out the © notice
cont char my©Notice[] = "This program is © 2016 ACME INC";
puts(my©Notice);
#endif


If C allows non-ASCII characters to appear in the above listed usages, are there any restrictions on the code points which may be used?

Keep in mind that this is a question about C standards. I already realize that putting unicode characters into identifiers and macros will make the code more difficult to use.

Answer

It's implementation defined, and thus not regulated by the standard.

I know of at least one compiler, namely clang, that requires the source to be UTF-8. But other compilers might use other requirements, or not allow it.

Since C99, identifiers are allowed to contain multi-byte characters, but before C99 it would be an extension to allow non-basic characters there. C11 expanded the set of allowed characters.

There's some additional restrictions on what characters are allowed in identifiers, and © is not in the list. It's listed in appendix D. These are Unicode points, but that doesn't strictly mean the encoding in the file has to be unicode-based.

Ranges of characters allowed

  • 00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
  • 0100−167F, 1681−180D, 180F−1FFF
  • 200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
  • 2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
  • 3004−3007, 3021−302F, 3031−303F
  • 3040−D7FF
  • F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
  • 10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD

Ranges of characters disallowed initially

  • 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F