Igor Liferenko Igor Liferenko - 1 month ago 8
C Question

How to fix locale?

Add ru_RU.CP1251 locale (on debian uncomment

ru_RU.CP1251
in
/etc/locale.gen
and run
sudo locale-gen
) and
compile the following program with
gcc -fexec-charset=cp1251 test.c
(input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.

#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");

if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}


Why neither
islower()
nor
isupper()
work on letter
я
?

UPDATE

Character with code 0xff does not work in any locale. Check also fr_FR.ISO-8859-1. Seems like a bug in glibc.

#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "fr_FR.ISO-8859-1");

/* 11111111 */
char c = 0;
c |= 1 << 0;
c |= 1 << 1;
c |= 1 << 2;
c |= 1 << 3;
c |= 1 << 4;
c |= 1 << 5;
c |= 1 << 6;
c |= 1 << 7;

if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}

Answer

Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)

iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt

On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)

Only once you control exactly what locales are used at each stage, you'll get coherent results.

The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.

If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.

I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.

NOTE

On debian linux:

$ sed 's/^/    /' pru-$$.c 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>

#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)

int main()
{
    setlocale(LC_ALL, "");
    Q(0xff);
}

Compiled with

$ make pru-$$
cc    pru-1342.c   -o pru-1342

execution with ru_RU.CP1251 locale

$ locale | sed 's/^/    /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=

$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512

So, glibc is not faulty, the fault is in your code.