I have read that every process has a set of locale variables associated with it. For example, these are the locale variables associated with the bash process on my system:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I want to know who actually uses these locale variables.
Do the C standard functions (for example: fwrite()) and the Linux system calls use them? Does the behavior of some C standard functions or some Linux system call differ depending on the value of some locale variable?
Or is it only certain programs that can use these locale variables? For example, I can write a program that will display messages to the user in a different language depending on the value of the LANG locale variable.
By default, C's standard library functions use the "C" locale. You can switch it to the user locale to enable locale-specific:
Character handling
Collating
Date/time formatting
Numeric editing
Monetary formatting
Messaging
POSIX setlocale documentation contains an incomplete list of locale-dependent functions affected by it:
catopen, exec, fprintf, fscanf, isalnum, isalpha, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswblank, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, localeconv, mblen, mbstowcs, mbtowc, newlocale, nl_langinfo, perror, psiginfo, setlocale, strcoll, strerror, strfmon, strftime, strsignal, strtod, strxfrm, tolower, toupper, towlower, towupper, uselocale, wcscoll, wcstod, wcstombs, wcsxfrm, wctomb
E.g.:
printf("%'d\n", 1000000000);
printf("Setting LC_ALL to %s\n", getenv("LANG"));
setlocale(LC_ALL, ""); // Set user-preferred locale.
printf("%'d\n", 1000000000);
Outputs:
1000000000
Setting LC_ALL to en_US.UTF-8
1,000,000,000
I have read that every process has a set of locale variables associated with it.
That's not really true, or at least it is highly over-simplified.
Many standard library functions (and non-standard library functions) modify their behaviour based on a set of locale configurations which are maintained in some hidden global object within the standard library implementation. (In some library implementations, the locale configuration is maintained per-thread rather than globally, using thread-local static variables.) That may seem to be associated with a process, since typically each process has a single instance of the standard library's runtime, but it's important to understand that -- despite appearances -- locale support is part of the library, not the OS kernel. (Of course, nothing in any standard defines where the kernel's boundaries are, or even what a kernel might be. You could run your program "bare metal" or you might have an OS which considers it useful to implement the standard library within system calls. I'm talking here about common cases.)
Basic locale configuration is defined by the C standard in section 7.11 (of the C11 standard), which defines two interfaces:
setlocale, which modifies the library's locale configuration, and
localeconv, which queries part of the locale configuration, allowing user code to conform to the locale's numeric formatting conventions (including monetary formatting).
The locale configuration is divided into a number of more-or-less independent components, called "categories". (The C++ standard library calls these "facets", which is also a commonly-used word.) There are five categories defined by the C standard and one more defined by Posix, but the categories are open-ended; individual standard library implementations are free to add additional categories. For example, the Gnu standard C library used on most Linux systems currently has a total of 12 categories. (See man 7 locale on your system for a current list.)
The standard categories are:
LC_CTYPE: Character classification and case conversion.
LC_COLLATE: Collation order.
LC_MONETARY: Monetary formatting.
LC_NUMERIC: Numeric, non-monetary formatting.
LC_TIME: Date and time formats.
and the Posix extension is:
LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.
Aside from localeconv, which only provides access to specific configurations from the LC_NUMERIC and LC_MONETARY categories, there is no way to query any specific configuration.
Also, there is no standard way at all to set a single configuration. All you can do is use setlocale to configure an entire category, using a library-dependent and non-standardised locale name (which is just a character string). More precisely, two locale names are standardised:
The C standard defines the locale name C.
Posix defines the locale name POSIX. However, Posix specifies that the corresponding locale shall be identical to the locale named C.
The details for locale-naming are (or should be) detailed in the locale documentation for the environment you're working in, but normally a locale-aware program will never call setlocale with a string constant other than the standard names, or the empty string. (I'll get to that in a minute.)
The setlocale interface allows the program to set an individual locale category, or to set all locale categories to the same locale name. It also returns a string which can be used to return to a previously configured locale category (or complete configuration).
The category names shown in the list of categories above are macros defined in <locale.h>. An additional macro, LC_ALL, is also defined by that header file: LC_ALL. One of these macros must be used as the first argument to setlocale.
The C and Posix standards both require that the initial locale setting on program startup is the C locale. Many aspects of the C locale are standardised (and somewhat more aspects of the Posix locale are standardised). This standardisation allows a programmer to predict how numeric conversions will work, for example.
But it is often the case that a programmer will want to interact with the program's user with that user's own locale preferences. It is obviously not desirable that every single program have its own idiosyncratic mechanism for determining what the user's locale preferences are, so the standard library provides a mechanism for setting the locale (or individual locale categories) to whatever the default locale is configured to: calling setlocale with the empty string ("") as a locale name. The C standard does not specify any particular mechanism for configuring this information; it merely assumes that one exists.
(Side note: Calling setlocale with an empty string as locale name is not the same as calling setlocale with NULL as locale name. NULL tells setlocale to not change any locale setting, but it will still return the string associated with the current locale. This avoids the need for a getlocale interface.)
Posix does specify a mechanism for configuring user preferences, and it also insists that (most) standardised command-line utilities operate in the default locale. That mechanism uses environment variables whose names correspond to the setlocale category macros.
On a Posix implementation, when the program calls setlocale(LC_X, ""); the library will proceed to examine the current environment:
First, it looks for the environment variable LC_ALL. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the first argument to setlocale was not LC_ALL it looks for the environment variable whose name is the same as that argument. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the environment variable LANG is defined and has a non-empty value, it is used (in some implementation dependent way) to construct a locale name. (LANG is supposed to indicate the user's language, which is an important part of their locale preferences.)
Finally, some system-wide default is used.
Environment variables are generally initialised by the login program (or GUI equivalent) on the basis of system configuration files. (The precise mechanism varies from distribution to distribution and documentation is often difficult to find.)
As mentioned, almost all standard shell utilities are required by Posix to do the equivalent of setlocale(LC_ALL, ""); in order to operate in the user's configured locale. Every utility's manpage (or other documentation) should specify whether it does this or not, but it's reasonable to assume that it does unless there is some information to the contrary.
Also, many (but not all) standard library string functions are locale-aware. Library interfaces which are definitely not locale-aware include isdigit and isxdigit, which always respond on the basis of the C locale, and strcmp, which compares strings in the same way as memcmp, using the char value (interpreted as an unsigned int) to determine collation order. (strcoll is locale-aware, if you want to do comparison according to LC_COLLATE.) And the character encodings used for wide and multibyte characters are controlled (in some unspecified way) by the LC_CTYPE category.
Many programs set the locale, and use it at least for internationalization. Some specific examples:
LANG="en_GB.UTF-8"
This is the locale for any category you didn’t specifically set to something else. It allows the system to add new locale variables in a backward-compatible way.
LC_COLLATE="en_GB.UTF-8"
This selects which language’s sorting order is used on strings. For example, Ch is considered a letter in Spanish and would come after Cz. One C library function that uses it is strcoll(), and POSIX commands that do include ls (when you sort files by name) and sort.
LC_CTYPE="en_GB.UTF-8"
This determines the current character encoding. In C11, you can set this and then use wide-character input and output, such as wprintf(). The library will transparently convert between wide characters and the character set used by the outside world. This still doesn’t quite work on Windows, unless you do some extra magic, but elsewhere, UTF-8 has become the standard. An increasing number of programs, such as clang (as of version 7), no longer support anything but UTF-8.
LC_MESSAGES="en_GB.UTF-8"
This determines what language and character set you see localized messages in. In C on Unix/Linux, these would typically be loaded from a .po file by the gettext library.
LC_MONETARY="en_GB.UTF-8"
This affects how strfmon() formats monetary quantities.
LC_NUMERIC="en_GB.UTF-8"
This determines the formatting of numbers that aren’t amounts of money.
LC_TIME="en_GB.UTF-8"
This affects the formatting of time. Try LC_TIME=fr_FR.UTF-8 date in the shell to see an example. (Or use locale -a | grep UTF to select some suitably-exotic locale.) Also a good test of whether your timezone and ntpd are working properly.
LC_ALL=
Use LANG instead of this. It sets every locale category at once, but it overrides the values in all the other locale variables. It exists for backward compatibility.
For example, I use LANG=en_US.utf8 on my Linux box, but I override LC_TIME=en_GB.utf8 to get 24-hour time in English. This would not be possible to do if LC_ALL were set.
LANG also allows your defaults to carry over into whatever other locale information your system supports, such as LC_ADDRESS, LC_IDENTIFICATION, LC_RESPONSE, LC_MEASUREMENT and LC_TELEPHONE.
I am using setlocale(LC_ALL,"Portuguese") so my program can read brazillian portuguese accents worlds like "joão" from a text file and print it at screen, and it works fine for this purpose. But when i try to input a word like "joão" from the keyboard and using gets() or scanf() the string saved is something different from the input . Any advices ?
If you are expecting terminal input, it is rarely correct to use setlocale in any way other than
setlocale(LC_ALL, "");
That will set the program's locale to the environment's locale. Normally, the locale setting in the interactive environment corresponds to the configuration of the terminal, so it represents the expectation of the interactive user. Changing the program's locale has no effect on the terminal [Note 1], so if you do change it, it will simply mean that the program's locale no longer corresponds to the user's expectations.
It would be correct to setlocale for file input if you provide some mechanism to specify the environment for the file [Note 2]. In Unix, however, the simplest way for the user to specify that is on the command-line:
LC_ALL=pt_BR.utf8 ./my_command the_portuguese_file.utf8
For Windows, you may want to provide a different mechanism to communicate the file's locale to the program. But in the absence of such a declaration, using the locale configured in the environment will usually be the correct option.
The one exception to the above is programs which prefer to be locale-unaware, which may wish to set the locale to "C" (or "POSIX", but "C" does not require a Posix-compatible setlocale). That can be useful to do as a form of self-documentation, but it is not necessary because a program which does not call setlocale at all will be executed in the "C" locale (on most operating systems).
Notes
In most cases, changing the environment's locale by modifying the value of the environment variable LC_ALL also has not effect on the terminal configuration. Indeed, the terminal may not even be part of the environment; for example, if you have a remote ssh/telnet session, or the GUI equivalent. A user should first configure their terminal according to their expectations, and then configure their environment to correspond; they will expect utility programs they run to respect the environment setting.
Aside from the strings "C", "POSIX" and "", there are no standards which will let you even know what possible locale names are, which is yet another reason not to try to set the locale except when the user has asked you to.
First of all, this must be really solved in C, and with UNIX standard C functions (because of project constraints). So, C++ or alternative libraries are outside the scope of the question.
I know how to set the default user locale with setlocale, as well as setting the standard C/POSIX locales.
However, I'm in a situation where the decimal separator is file-specified, so I want my program to temporally change the decimal separator.
LC_NUMERIC expects a locale name... but I don't want to give it a locale name, but the separator character directly.
How can this be done?
Well, I'm afraid you won't like the solution :)
First of all, since you're operating with setlocale you have to supply a locale name. Therefore there should be a locale with LC_NUMERIC property defined by you in the time of program execution. Therefore you need to define a new locale. You may define it with localedef You may use this doc as a guide for making and using a new locale and this site to get source files which you can use as a template for your custom locale definition.
I need to build a OS, a very small and basic one, with actually least functionality, coded in C.
Probably a CUI OS which does some memory management and has at least a text editor and a calculator, its just going to be a experimentation about how to make a code that has full and direct control over your hardware.
Still I'll be requiring an interface, that will need input/output functions like printf(&args), scanf(&args). Now my basic question is should I use existing headers or go for coding actually from scratch, and why so ?
I'd be more than very thankful to you guys for and help.
First, you can't link against anything from libc ... you're going to have to code everything from scratch.
Now having worked on a micro-kernel myself, I would not use the actual stdio headers that come with libc since they are going to be cluttered with a lot of extra information that will be either irrelevant for your OS, or will create compiler errors due to missing definitions, etc. What I would do though is keep the function signatures for these standard functions the same ... so in the end you would have a file called stdio.h for your OS, but it would be a very stripped down header file with the basic minimum requirements for your needs, and only having the standard I/O functions you need, with the correct standard signatures.
Keep in mind on the back-end, i.e., in your stdio.c file, you're going to have to point these functions to a custom console-driver or some other type of character drive for your display. Either that, or you could just use them as wrappers for some other kernel-level display printing routine. You are also going to want to make sure that even though you may use a #include <stdio.h> directive in your other OS code modules to access these printing functions, you do not link against libc. This can be done using gcc -ffreestanding.
Just retarget newlib.
printf, scanf, etc relies on implementation specific funcions to get a single char or print a single char. You can then make your stdin and stdout the UART 1 for example.
Kernel itself would not require the printf and scanf functions, if you do not want to keep the kernel in kernel mode and work the apps you have planned for. But for basic printf and scanf features, you can write your own printf and scanf functions, which would provide basic support for printing ans taking input. I do not have much experience on this, but you can try make a console buffer, where the keyboard driver puts the read in ASCII characters (after conversion from scan codes), and then make the printf and scanf work on it. I have one basic implementation were i have wrote a gets instead of scanf and kept things simple. To get integer output you can write an atoi function to convert the string to a number.
To port in other libraries, you need to make the components which the libraries depend on. You need to make the decision if you can code in those support in the kernel so that the libraries could be ported in. If it is more difficult then coding some basic input output functions i think won't be bad at this stage,
A user of my program has reported problems reading a settings file written by my program. I looked at the settings file in question and instead of decimal points using the period "." it uses commas ",".
I'm assuming this is to do with locales?
The file i/o is using fprintf and mpfr_out_str for file output and getline combined with atol, atof, mpfr_set_str, etc for file input.
What do I do here? Should I force my program to always use periods even if the machine's locale wants to use commas? If so, where do I start?
Edit: I've just noticed that this problem occurs when specifying the settings file to use on the command line instead of loading it via the GUI - would this indicate a problem on the OP's machine or in my code?
Do you call setlocale at all? If not, I would suggest either embedding the locale used to generate the file in the settings file or force all settings file I/O to use the C locale, via the previous suggestion of setlocale(LC_ALL, "C").
One other option is to use the locale specific formatting functions (suffixed with _l in MSVC) and create the C locale explicitly, via _create_locale(LC_ALL, "C").