I have read that every process has a set of locale variables associated with it. For example, these are the locale variables associated with the bash process on my system:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I want to know who actually uses these locale variables.
Do the C standard functions (for example: fwrite()) and the Linux system calls use them? Does the behavior of some C standard functions or some Linux system call differ depending on the value of some locale variable?
Or is it only certain programs that can use these locale variables? For example, I can write a program that will display messages to the user in a different language depending on the value of the LANG locale variable.
By default, C's standard library functions use the "C" locale. You can switch it to the user locale to enable locale-specific:
Character handling
Collating
Date/time formatting
Numeric editing
Monetary formatting
Messaging
POSIX setlocale documentation contains an incomplete list of locale-dependent functions affected by it:
catopen, exec, fprintf, fscanf, isalnum, isalpha, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswblank, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, localeconv, mblen, mbstowcs, mbtowc, newlocale, nl_langinfo, perror, psiginfo, setlocale, strcoll, strerror, strfmon, strftime, strsignal, strtod, strxfrm, tolower, toupper, towlower, towupper, uselocale, wcscoll, wcstod, wcstombs, wcsxfrm, wctomb
E.g.:
printf("%'d\n", 1000000000);
printf("Setting LC_ALL to %s\n", getenv("LANG"));
setlocale(LC_ALL, ""); // Set user-preferred locale.
printf("%'d\n", 1000000000);
Outputs:
1000000000
Setting LC_ALL to en_US.UTF-8
1,000,000,000
I have read that every process has a set of locale variables associated with it.
That's not really true, or at least it is highly over-simplified.
Many standard library functions (and non-standard library functions) modify their behaviour based on a set of locale configurations which are maintained in some hidden global object within the standard library implementation. (In some library implementations, the locale configuration is maintained per-thread rather than globally, using thread-local static variables.) That may seem to be associated with a process, since typically each process has a single instance of the standard library's runtime, but it's important to understand that -- despite appearances -- locale support is part of the library, not the OS kernel. (Of course, nothing in any standard defines where the kernel's boundaries are, or even what a kernel might be. You could run your program "bare metal" or you might have an OS which considers it useful to implement the standard library within system calls. I'm talking here about common cases.)
Basic locale configuration is defined by the C standard in section 7.11 (of the C11 standard), which defines two interfaces:
setlocale, which modifies the library's locale configuration, and
localeconv, which queries part of the locale configuration, allowing user code to conform to the locale's numeric formatting conventions (including monetary formatting).
The locale configuration is divided into a number of more-or-less independent components, called "categories". (The C++ standard library calls these "facets", which is also a commonly-used word.) There are five categories defined by the C standard and one more defined by Posix, but the categories are open-ended; individual standard library implementations are free to add additional categories. For example, the Gnu standard C library used on most Linux systems currently has a total of 12 categories. (See man 7 locale on your system for a current list.)
The standard categories are:
LC_CTYPE: Character classification and case conversion.
LC_COLLATE: Collation order.
LC_MONETARY: Monetary formatting.
LC_NUMERIC: Numeric, non-monetary formatting.
LC_TIME: Date and time formats.
and the Posix extension is:
LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.
Aside from localeconv, which only provides access to specific configurations from the LC_NUMERIC and LC_MONETARY categories, there is no way to query any specific configuration.
Also, there is no standard way at all to set a single configuration. All you can do is use setlocale to configure an entire category, using a library-dependent and non-standardised locale name (which is just a character string). More precisely, two locale names are standardised:
The C standard defines the locale name C.
Posix defines the locale name POSIX. However, Posix specifies that the corresponding locale shall be identical to the locale named C.
The details for locale-naming are (or should be) detailed in the locale documentation for the environment you're working in, but normally a locale-aware program will never call setlocale with a string constant other than the standard names, or the empty string. (I'll get to that in a minute.)
The setlocale interface allows the program to set an individual locale category, or to set all locale categories to the same locale name. It also returns a string which can be used to return to a previously configured locale category (or complete configuration).
The category names shown in the list of categories above are macros defined in <locale.h>. An additional macro, LC_ALL, is also defined by that header file: LC_ALL. One of these macros must be used as the first argument to setlocale.
The C and Posix standards both require that the initial locale setting on program startup is the C locale. Many aspects of the C locale are standardised (and somewhat more aspects of the Posix locale are standardised). This standardisation allows a programmer to predict how numeric conversions will work, for example.
But it is often the case that a programmer will want to interact with the program's user with that user's own locale preferences. It is obviously not desirable that every single program have its own idiosyncratic mechanism for determining what the user's locale preferences are, so the standard library provides a mechanism for setting the locale (or individual locale categories) to whatever the default locale is configured to: calling setlocale with the empty string ("") as a locale name. The C standard does not specify any particular mechanism for configuring this information; it merely assumes that one exists.
(Side note: Calling setlocale with an empty string as locale name is not the same as calling setlocale with NULL as locale name. NULL tells setlocale to not change any locale setting, but it will still return the string associated with the current locale. This avoids the need for a getlocale interface.)
Posix does specify a mechanism for configuring user preferences, and it also insists that (most) standardised command-line utilities operate in the default locale. That mechanism uses environment variables whose names correspond to the setlocale category macros.
On a Posix implementation, when the program calls setlocale(LC_X, ""); the library will proceed to examine the current environment:
First, it looks for the environment variable LC_ALL. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the first argument to setlocale was not LC_ALL it looks for the environment variable whose name is the same as that argument. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the environment variable LANG is defined and has a non-empty value, it is used (in some implementation dependent way) to construct a locale name. (LANG is supposed to indicate the user's language, which is an important part of their locale preferences.)
Finally, some system-wide default is used.
Environment variables are generally initialised by the login program (or GUI equivalent) on the basis of system configuration files. (The precise mechanism varies from distribution to distribution and documentation is often difficult to find.)
As mentioned, almost all standard shell utilities are required by Posix to do the equivalent of setlocale(LC_ALL, ""); in order to operate in the user's configured locale. Every utility's manpage (or other documentation) should specify whether it does this or not, but it's reasonable to assume that it does unless there is some information to the contrary.
Also, many (but not all) standard library string functions are locale-aware. Library interfaces which are definitely not locale-aware include isdigit and isxdigit, which always respond on the basis of the C locale, and strcmp, which compares strings in the same way as memcmp, using the char value (interpreted as an unsigned int) to determine collation order. (strcoll is locale-aware, if you want to do comparison according to LC_COLLATE.) And the character encodings used for wide and multibyte characters are controlled (in some unspecified way) by the LC_CTYPE category.
Many programs set the locale, and use it at least for internationalization. Some specific examples:
LANG="en_GB.UTF-8"
This is the locale for any category you didn’t specifically set to something else. It allows the system to add new locale variables in a backward-compatible way.
LC_COLLATE="en_GB.UTF-8"
This selects which language’s sorting order is used on strings. For example, Ch is considered a letter in Spanish and would come after Cz. One C library function that uses it is strcoll(), and POSIX commands that do include ls (when you sort files by name) and sort.
LC_CTYPE="en_GB.UTF-8"
This determines the current character encoding. In C11, you can set this and then use wide-character input and output, such as wprintf(). The library will transparently convert between wide characters and the character set used by the outside world. This still doesn’t quite work on Windows, unless you do some extra magic, but elsewhere, UTF-8 has become the standard. An increasing number of programs, such as clang (as of version 7), no longer support anything but UTF-8.
LC_MESSAGES="en_GB.UTF-8"
This determines what language and character set you see localized messages in. In C on Unix/Linux, these would typically be loaded from a .po file by the gettext library.
LC_MONETARY="en_GB.UTF-8"
This affects how strfmon() formats monetary quantities.
LC_NUMERIC="en_GB.UTF-8"
This determines the formatting of numbers that aren’t amounts of money.
LC_TIME="en_GB.UTF-8"
This affects the formatting of time. Try LC_TIME=fr_FR.UTF-8 date in the shell to see an example. (Or use locale -a | grep UTF to select some suitably-exotic locale.) Also a good test of whether your timezone and ntpd are working properly.
LC_ALL=
Use LANG instead of this. It sets every locale category at once, but it overrides the values in all the other locale variables. It exists for backward compatibility.
For example, I use LANG=en_US.utf8 on my Linux box, but I override LC_TIME=en_GB.utf8 to get 24-hour time in English. This would not be possible to do if LC_ALL were set.
LANG also allows your defaults to carry over into whatever other locale information your system supports, such as LC_ADDRESS, LC_IDENTIFICATION, LC_RESPONSE, LC_MEASUREMENT and LC_TELEPHONE.
I am using setlocale(LC_ALL,"Portuguese") so my program can read brazillian portuguese accents worlds like "joão" from a text file and print it at screen, and it works fine for this purpose. But when i try to input a word like "joão" from the keyboard and using gets() or scanf() the string saved is something different from the input . Any advices ?
If you are expecting terminal input, it is rarely correct to use setlocale in any way other than
setlocale(LC_ALL, "");
That will set the program's locale to the environment's locale. Normally, the locale setting in the interactive environment corresponds to the configuration of the terminal, so it represents the expectation of the interactive user. Changing the program's locale has no effect on the terminal [Note 1], so if you do change it, it will simply mean that the program's locale no longer corresponds to the user's expectations.
It would be correct to setlocale for file input if you provide some mechanism to specify the environment for the file [Note 2]. In Unix, however, the simplest way for the user to specify that is on the command-line:
LC_ALL=pt_BR.utf8 ./my_command the_portuguese_file.utf8
For Windows, you may want to provide a different mechanism to communicate the file's locale to the program. But in the absence of such a declaration, using the locale configured in the environment will usually be the correct option.
The one exception to the above is programs which prefer to be locale-unaware, which may wish to set the locale to "C" (or "POSIX", but "C" does not require a Posix-compatible setlocale). That can be useful to do as a form of self-documentation, but it is not necessary because a program which does not call setlocale at all will be executed in the "C" locale (on most operating systems).
Notes
In most cases, changing the environment's locale by modifying the value of the environment variable LC_ALL also has not effect on the terminal configuration. Indeed, the terminal may not even be part of the environment; for example, if you have a remote ssh/telnet session, or the GUI equivalent. A user should first configure their terminal according to their expectations, and then configure their environment to correspond; they will expect utility programs they run to respect the environment setting.
Aside from the strings "C", "POSIX" and "", there are no standards which will let you even know what possible locale names are, which is yet another reason not to try to set the locale except when the user has asked you to.
First of all, this must be really solved in C, and with UNIX standard C functions (because of project constraints). So, C++ or alternative libraries are outside the scope of the question.
I know how to set the default user locale with setlocale, as well as setting the standard C/POSIX locales.
However, I'm in a situation where the decimal separator is file-specified, so I want my program to temporally change the decimal separator.
LC_NUMERIC expects a locale name... but I don't want to give it a locale name, but the separator character directly.
How can this be done?
Well, I'm afraid you won't like the solution :)
First of all, since you're operating with setlocale you have to supply a locale name. Therefore there should be a locale with LC_NUMERIC property defined by you in the time of program execution. Therefore you need to define a new locale. You may define it with localedef You may use this doc as a guide for making and using a new locale and this site to get source files which you can use as a template for your custom locale definition.
I use OS X Yosemite.
When I run locale I get this:
locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Question
Is the emptiness of LANG and LC_ALL bad/normal/prefered?
Normally, I wouldn't care that much about it, but I've got a warning
(process:16182): Gtk-WARNING **: Locale not supported by C library.
Using the fallback 'C' locale.
when I was using GTK (here's a link to my previous quesiton on this).
People have been struggling with this problem in many languages (Python for example) and different OS (Ubuntu for example).
The point is I couldn't find any solution for C language and OS X.
I would guess that the GTK warning is because GTK is actually trying to use the Mac language and locale settings from System Preferences to make a locale identifier string, using that string with setlocale(), and being told that the C library doesn't support that locale. As a result, it's defaulting to the "C" locale. If it weren't trying to find a better locale, there would be little reason to warn that it's using the "C" locale because that's what would be expected when LANG and LC_ALL are unset.
OS X has support for many languages and locales in the high-level frameworks (Cocoa, etc.), but not all of those are also supported at the level of the C library. What are the language and locale settings in System Preferences? What locale identifier would you expect for your language and locale? See if that's in the output from locale -a (or, similarly, if there's a directory for it in /usr/share/locale).
Another thing to check is Terminal's preferences. On the Settings pane, under the Advanced tab, is "Set locale environment variables on startup" set? If not, then those environment variables won't be set by default, which might explain what you're seeing. If the setting is enabled but you're still not getting those environment variables, that suggests that Terminal was not able to find a suitable C-library locale that matches your system settings.
Finally, you can simply try setting LANG to what you want to use. For example:
export LANG=pl_PL.UTF-8
getenv is used for accessing environment variables on Mac OS X and Linux, it takes char* as input. Does that mean that I cannot store UNICODE strings as value in these environment variables on these Systems?
While on Windows GetEnvironmentVariable etc, return wide strings that can accommodate UTF16 strings.
Unix systems were not invented with widestrings in mind, back then. So there is no possibility to create widestring environment variables or read them.
For Windows there is as an expansion to the C-Runtime wchar_t *_wgetenv( const wchar_t *varname );, but this won't give you much use on Unix-Systems.
On current Linux (and probably also MacOSX), UTF-8 encoded strings are very usual. (But there are exceptions, see locale command, etc...).
As Michael Burr commented, you could suppose that getenv is returning an UTF-8 string. But if you want maximal portability, use ASCII only in environment variables.
From the C or C++ programmer's point of view, getenv(3) returns a char * and you could want to use UTF-8 related functions to handle it. Notice that getenv does not return a wchar_t* pointer.
See the locale(7) man page and notice that the current locale could be defined by environment variables like LANG, LC_ALL, etc... See environ(7).