I have read that every process has a set of locale variables associated with it. For example, these are the locale variables associated with the bash process on my system:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I want to know who actually uses these locale variables.
Do the C standard functions (for example: fwrite()) and the Linux system calls use them? Does the behavior of some C standard functions or some Linux system call differ depending on the value of some locale variable?
Or is it only certain programs that can use these locale variables? For example, I can write a program that will display messages to the user in a different language depending on the value of the LANG locale variable.
By default, C's standard library functions use the "C" locale. You can switch it to the user locale to enable locale-specific:
Character handling
Collating
Date/time formatting
Numeric editing
Monetary formatting
Messaging
POSIX setlocale documentation contains an incomplete list of locale-dependent functions affected by it:
catopen, exec, fprintf, fscanf, isalnum, isalpha, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswblank, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, localeconv, mblen, mbstowcs, mbtowc, newlocale, nl_langinfo, perror, psiginfo, setlocale, strcoll, strerror, strfmon, strftime, strsignal, strtod, strxfrm, tolower, toupper, towlower, towupper, uselocale, wcscoll, wcstod, wcstombs, wcsxfrm, wctomb
E.g.:
printf("%'d\n", 1000000000);
printf("Setting LC_ALL to %s\n", getenv("LANG"));
setlocale(LC_ALL, ""); // Set user-preferred locale.
printf("%'d\n", 1000000000);
Outputs:
1000000000
Setting LC_ALL to en_US.UTF-8
1,000,000,000
I have read that every process has a set of locale variables associated with it.
That's not really true, or at least it is highly over-simplified.
Many standard library functions (and non-standard library functions) modify their behaviour based on a set of locale configurations which are maintained in some hidden global object within the standard library implementation. (In some library implementations, the locale configuration is maintained per-thread rather than globally, using thread-local static variables.) That may seem to be associated with a process, since typically each process has a single instance of the standard library's runtime, but it's important to understand that -- despite appearances -- locale support is part of the library, not the OS kernel. (Of course, nothing in any standard defines where the kernel's boundaries are, or even what a kernel might be. You could run your program "bare metal" or you might have an OS which considers it useful to implement the standard library within system calls. I'm talking here about common cases.)
Basic locale configuration is defined by the C standard in section 7.11 (of the C11 standard), which defines two interfaces:
setlocale, which modifies the library's locale configuration, and
localeconv, which queries part of the locale configuration, allowing user code to conform to the locale's numeric formatting conventions (including monetary formatting).
The locale configuration is divided into a number of more-or-less independent components, called "categories". (The C++ standard library calls these "facets", which is also a commonly-used word.) There are five categories defined by the C standard and one more defined by Posix, but the categories are open-ended; individual standard library implementations are free to add additional categories. For example, the Gnu standard C library used on most Linux systems currently has a total of 12 categories. (See man 7 locale on your system for a current list.)
The standard categories are:
LC_CTYPE: Character classification and case conversion.
LC_COLLATE: Collation order.
LC_MONETARY: Monetary formatting.
LC_NUMERIC: Numeric, non-monetary formatting.
LC_TIME: Date and time formats.
and the Posix extension is:
LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.
Aside from localeconv, which only provides access to specific configurations from the LC_NUMERIC and LC_MONETARY categories, there is no way to query any specific configuration.
Also, there is no standard way at all to set a single configuration. All you can do is use setlocale to configure an entire category, using a library-dependent and non-standardised locale name (which is just a character string). More precisely, two locale names are standardised:
The C standard defines the locale name C.
Posix defines the locale name POSIX. However, Posix specifies that the corresponding locale shall be identical to the locale named C.
The details for locale-naming are (or should be) detailed in the locale documentation for the environment you're working in, but normally a locale-aware program will never call setlocale with a string constant other than the standard names, or the empty string. (I'll get to that in a minute.)
The setlocale interface allows the program to set an individual locale category, or to set all locale categories to the same locale name. It also returns a string which can be used to return to a previously configured locale category (or complete configuration).
The category names shown in the list of categories above are macros defined in <locale.h>. An additional macro, LC_ALL, is also defined by that header file: LC_ALL. One of these macros must be used as the first argument to setlocale.
The C and Posix standards both require that the initial locale setting on program startup is the C locale. Many aspects of the C locale are standardised (and somewhat more aspects of the Posix locale are standardised). This standardisation allows a programmer to predict how numeric conversions will work, for example.
But it is often the case that a programmer will want to interact with the program's user with that user's own locale preferences. It is obviously not desirable that every single program have its own idiosyncratic mechanism for determining what the user's locale preferences are, so the standard library provides a mechanism for setting the locale (or individual locale categories) to whatever the default locale is configured to: calling setlocale with the empty string ("") as a locale name. The C standard does not specify any particular mechanism for configuring this information; it merely assumes that one exists.
(Side note: Calling setlocale with an empty string as locale name is not the same as calling setlocale with NULL as locale name. NULL tells setlocale to not change any locale setting, but it will still return the string associated with the current locale. This avoids the need for a getlocale interface.)
Posix does specify a mechanism for configuring user preferences, and it also insists that (most) standardised command-line utilities operate in the default locale. That mechanism uses environment variables whose names correspond to the setlocale category macros.
On a Posix implementation, when the program calls setlocale(LC_X, ""); the library will proceed to examine the current environment:
First, it looks for the environment variable LC_ALL. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the first argument to setlocale was not LC_ALL it looks for the environment variable whose name is the same as that argument. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the environment variable LANG is defined and has a non-empty value, it is used (in some implementation dependent way) to construct a locale name. (LANG is supposed to indicate the user's language, which is an important part of their locale preferences.)
Finally, some system-wide default is used.
Environment variables are generally initialised by the login program (or GUI equivalent) on the basis of system configuration files. (The precise mechanism varies from distribution to distribution and documentation is often difficult to find.)
As mentioned, almost all standard shell utilities are required by Posix to do the equivalent of setlocale(LC_ALL, ""); in order to operate in the user's configured locale. Every utility's manpage (or other documentation) should specify whether it does this or not, but it's reasonable to assume that it does unless there is some information to the contrary.
Also, many (but not all) standard library string functions are locale-aware. Library interfaces which are definitely not locale-aware include isdigit and isxdigit, which always respond on the basis of the C locale, and strcmp, which compares strings in the same way as memcmp, using the char value (interpreted as an unsigned int) to determine collation order. (strcoll is locale-aware, if you want to do comparison according to LC_COLLATE.) And the character encodings used for wide and multibyte characters are controlled (in some unspecified way) by the LC_CTYPE category.
Many programs set the locale, and use it at least for internationalization. Some specific examples:
LANG="en_GB.UTF-8"
This is the locale for any category you didn’t specifically set to something else. It allows the system to add new locale variables in a backward-compatible way.
LC_COLLATE="en_GB.UTF-8"
This selects which language’s sorting order is used on strings. For example, Ch is considered a letter in Spanish and would come after Cz. One C library function that uses it is strcoll(), and POSIX commands that do include ls (when you sort files by name) and sort.
LC_CTYPE="en_GB.UTF-8"
This determines the current character encoding. In C11, you can set this and then use wide-character input and output, such as wprintf(). The library will transparently convert between wide characters and the character set used by the outside world. This still doesn’t quite work on Windows, unless you do some extra magic, but elsewhere, UTF-8 has become the standard. An increasing number of programs, such as clang (as of version 7), no longer support anything but UTF-8.
LC_MESSAGES="en_GB.UTF-8"
This determines what language and character set you see localized messages in. In C on Unix/Linux, these would typically be loaded from a .po file by the gettext library.
LC_MONETARY="en_GB.UTF-8"
This affects how strfmon() formats monetary quantities.
LC_NUMERIC="en_GB.UTF-8"
This determines the formatting of numbers that aren’t amounts of money.
LC_TIME="en_GB.UTF-8"
This affects the formatting of time. Try LC_TIME=fr_FR.UTF-8 date in the shell to see an example. (Or use locale -a | grep UTF to select some suitably-exotic locale.) Also a good test of whether your timezone and ntpd are working properly.
LC_ALL=
Use LANG instead of this. It sets every locale category at once, but it overrides the values in all the other locale variables. It exists for backward compatibility.
For example, I use LANG=en_US.utf8 on my Linux box, but I override LC_TIME=en_GB.utf8 to get 24-hour time in English. This would not be possible to do if LC_ALL were set.
LANG also allows your defaults to carry over into whatever other locale information your system supports, such as LC_ADDRESS, LC_IDENTIFICATION, LC_RESPONSE, LC_MEASUREMENT and LC_TELEPHONE.
I am using setlocale(LC_ALL,"Portuguese") so my program can read brazillian portuguese accents worlds like "joão" from a text file and print it at screen, and it works fine for this purpose. But when i try to input a word like "joão" from the keyboard and using gets() or scanf() the string saved is something different from the input . Any advices ?
If you are expecting terminal input, it is rarely correct to use setlocale in any way other than
setlocale(LC_ALL, "");
That will set the program's locale to the environment's locale. Normally, the locale setting in the interactive environment corresponds to the configuration of the terminal, so it represents the expectation of the interactive user. Changing the program's locale has no effect on the terminal [Note 1], so if you do change it, it will simply mean that the program's locale no longer corresponds to the user's expectations.
It would be correct to setlocale for file input if you provide some mechanism to specify the environment for the file [Note 2]. In Unix, however, the simplest way for the user to specify that is on the command-line:
LC_ALL=pt_BR.utf8 ./my_command the_portuguese_file.utf8
For Windows, you may want to provide a different mechanism to communicate the file's locale to the program. But in the absence of such a declaration, using the locale configured in the environment will usually be the correct option.
The one exception to the above is programs which prefer to be locale-unaware, which may wish to set the locale to "C" (or "POSIX", but "C" does not require a Posix-compatible setlocale). That can be useful to do as a form of self-documentation, but it is not necessary because a program which does not call setlocale at all will be executed in the "C" locale (on most operating systems).
Notes
In most cases, changing the environment's locale by modifying the value of the environment variable LC_ALL also has not effect on the terminal configuration. Indeed, the terminal may not even be part of the environment; for example, if you have a remote ssh/telnet session, or the GUI equivalent. A user should first configure their terminal according to their expectations, and then configure their environment to correspond; they will expect utility programs they run to respect the environment setting.
Aside from the strings "C", "POSIX" and "", there are no standards which will let you even know what possible locale names are, which is yet another reason not to try to set the locale except when the user has asked you to.
What is the most portable way to access locale information?
I'm interested in time locale data, such as month names, day of week names, local time format etc.
Ideally I'd like a POSIX interface, but if it doesn't exist, glibc-specific one will do.
If possible, getting the information about the locale X shouldn't require setting it (using uselocale() or similar).
Calling strftime() many times with all sorts of parameters is considered a hack, not a solution.
If there's nothing better, I'm willing to consider directly parsing glibc's locale files if there's a reliable way to determine their location.
nl_langinfo is a POSIX-standard interface for returning that information and appears to have available all of the things that you're looking for. Sadly, it does require that you call setlocale before calling it. I don't see an interface that lets you query an arbitrary locale without first making it the current locale.
I have some questions, but I can´t find straight answer anywhere.
So, basically, I know what locale is, I know how to use (set) it, but what I dont know is
how is work behind the scene, and I would very like to know it.
So, when I use functions for IO, lets say for example scanf do float, when I need to decide whether country use decimal point or comma (I am actually from decimal comma country :)),
does scanf function "look" to check the current locale?
But if I doesn´t set it in my code, does it by default creates some standard locale itself, OR does it get it from OS?
For example in the part of code when you get handle to console for stdout stderr and stdin?
By default your program will have the C locale.
When you run setlocale(LC_ALL,""); you will set the locale from the outside environment (or you can set just parts LC_*).
By calling setlocale(LC_ALL,"specific_locale"); you will set the specific locale.
All I/O functions should follow the current locale (standard C I/O functions).
The behind-the-code behaviour depends on the operating system and compiler you are using.
A user of my program has reported problems reading a settings file written by my program. I looked at the settings file in question and instead of decimal points using the period "." it uses commas ",".
I'm assuming this is to do with locales?
The file i/o is using fprintf and mpfr_out_str for file output and getline combined with atol, atof, mpfr_set_str, etc for file input.
What do I do here? Should I force my program to always use periods even if the machine's locale wants to use commas? If so, where do I start?
Edit: I've just noticed that this problem occurs when specifying the settings file to use on the command line instead of loading it via the GUI - would this indicate a problem on the OP's machine or in my code?
Do you call setlocale at all? If not, I would suggest either embedding the locale used to generate the file in the settings file or force all settings file I/O to use the C locale, via the previous suggestion of setlocale(LC_ALL, "C").
One other option is to use the locale specific formatting functions (suffixed with _l in MSVC) and create the C locale explicitly, via _create_locale(LC_ALL, "C").