I have read that every process has a set of locale variables associated with it. For example, these are the locale variables associated with the bash process on my system:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I want to know who actually uses these locale variables.
Do the C standard functions (for example: fwrite()) and the Linux system calls use them? Does the behavior of some C standard functions or some Linux system call differ depending on the value of some locale variable?
Or is it only certain programs that can use these locale variables? For example, I can write a program that will display messages to the user in a different language depending on the value of the LANG locale variable.
By default, C's standard library functions use the "C" locale. You can switch it to the user locale to enable locale-specific:
Character handling
Collating
Date/time formatting
Numeric editing
Monetary formatting
Messaging
POSIX setlocale documentation contains an incomplete list of locale-dependent functions affected by it:
catopen, exec, fprintf, fscanf, isalnum, isalpha, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswblank, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, localeconv, mblen, mbstowcs, mbtowc, newlocale, nl_langinfo, perror, psiginfo, setlocale, strcoll, strerror, strfmon, strftime, strsignal, strtod, strxfrm, tolower, toupper, towlower, towupper, uselocale, wcscoll, wcstod, wcstombs, wcsxfrm, wctomb
E.g.:
printf("%'d\n", 1000000000);
printf("Setting LC_ALL to %s\n", getenv("LANG"));
setlocale(LC_ALL, ""); // Set user-preferred locale.
printf("%'d\n", 1000000000);
Outputs:
1000000000
Setting LC_ALL to en_US.UTF-8
1,000,000,000
I have read that every process has a set of locale variables associated with it.
That's not really true, or at least it is highly over-simplified.
Many standard library functions (and non-standard library functions) modify their behaviour based on a set of locale configurations which are maintained in some hidden global object within the standard library implementation. (In some library implementations, the locale configuration is maintained per-thread rather than globally, using thread-local static variables.) That may seem to be associated with a process, since typically each process has a single instance of the standard library's runtime, but it's important to understand that -- despite appearances -- locale support is part of the library, not the OS kernel. (Of course, nothing in any standard defines where the kernel's boundaries are, or even what a kernel might be. You could run your program "bare metal" or you might have an OS which considers it useful to implement the standard library within system calls. I'm talking here about common cases.)
Basic locale configuration is defined by the C standard in section 7.11 (of the C11 standard), which defines two interfaces:
setlocale, which modifies the library's locale configuration, and
localeconv, which queries part of the locale configuration, allowing user code to conform to the locale's numeric formatting conventions (including monetary formatting).
The locale configuration is divided into a number of more-or-less independent components, called "categories". (The C++ standard library calls these "facets", which is also a commonly-used word.) There are five categories defined by the C standard and one more defined by Posix, but the categories are open-ended; individual standard library implementations are free to add additional categories. For example, the Gnu standard C library used on most Linux systems currently has a total of 12 categories. (See man 7 locale on your system for a current list.)
The standard categories are:
LC_CTYPE: Character classification and case conversion.
LC_COLLATE: Collation order.
LC_MONETARY: Monetary formatting.
LC_NUMERIC: Numeric, non-monetary formatting.
LC_TIME: Date and time formats.
and the Posix extension is:
LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.
Aside from localeconv, which only provides access to specific configurations from the LC_NUMERIC and LC_MONETARY categories, there is no way to query any specific configuration.
Also, there is no standard way at all to set a single configuration. All you can do is use setlocale to configure an entire category, using a library-dependent and non-standardised locale name (which is just a character string). More precisely, two locale names are standardised:
The C standard defines the locale name C.
Posix defines the locale name POSIX. However, Posix specifies that the corresponding locale shall be identical to the locale named C.
The details for locale-naming are (or should be) detailed in the locale documentation for the environment you're working in, but normally a locale-aware program will never call setlocale with a string constant other than the standard names, or the empty string. (I'll get to that in a minute.)
The setlocale interface allows the program to set an individual locale category, or to set all locale categories to the same locale name. It also returns a string which can be used to return to a previously configured locale category (or complete configuration).
The category names shown in the list of categories above are macros defined in <locale.h>. An additional macro, LC_ALL, is also defined by that header file: LC_ALL. One of these macros must be used as the first argument to setlocale.
The C and Posix standards both require that the initial locale setting on program startup is the C locale. Many aspects of the C locale are standardised (and somewhat more aspects of the Posix locale are standardised). This standardisation allows a programmer to predict how numeric conversions will work, for example.
But it is often the case that a programmer will want to interact with the program's user with that user's own locale preferences. It is obviously not desirable that every single program have its own idiosyncratic mechanism for determining what the user's locale preferences are, so the standard library provides a mechanism for setting the locale (or individual locale categories) to whatever the default locale is configured to: calling setlocale with the empty string ("") as a locale name. The C standard does not specify any particular mechanism for configuring this information; it merely assumes that one exists.
(Side note: Calling setlocale with an empty string as locale name is not the same as calling setlocale with NULL as locale name. NULL tells setlocale to not change any locale setting, but it will still return the string associated with the current locale. This avoids the need for a getlocale interface.)
Posix does specify a mechanism for configuring user preferences, and it also insists that (most) standardised command-line utilities operate in the default locale. That mechanism uses environment variables whose names correspond to the setlocale category macros.
On a Posix implementation, when the program calls setlocale(LC_X, ""); the library will proceed to examine the current environment:
First, it looks for the environment variable LC_ALL. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the first argument to setlocale was not LC_ALL it looks for the environment variable whose name is the same as that argument. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the environment variable LANG is defined and has a non-empty value, it is used (in some implementation dependent way) to construct a locale name. (LANG is supposed to indicate the user's language, which is an important part of their locale preferences.)
Finally, some system-wide default is used.
Environment variables are generally initialised by the login program (or GUI equivalent) on the basis of system configuration files. (The precise mechanism varies from distribution to distribution and documentation is often difficult to find.)
As mentioned, almost all standard shell utilities are required by Posix to do the equivalent of setlocale(LC_ALL, ""); in order to operate in the user's configured locale. Every utility's manpage (or other documentation) should specify whether it does this or not, but it's reasonable to assume that it does unless there is some information to the contrary.
Also, many (but not all) standard library string functions are locale-aware. Library interfaces which are definitely not locale-aware include isdigit and isxdigit, which always respond on the basis of the C locale, and strcmp, which compares strings in the same way as memcmp, using the char value (interpreted as an unsigned int) to determine collation order. (strcoll is locale-aware, if you want to do comparison according to LC_COLLATE.) And the character encodings used for wide and multibyte characters are controlled (in some unspecified way) by the LC_CTYPE category.
Many programs set the locale, and use it at least for internationalization. Some specific examples:
LANG="en_GB.UTF-8"
This is the locale for any category you didn’t specifically set to something else. It allows the system to add new locale variables in a backward-compatible way.
LC_COLLATE="en_GB.UTF-8"
This selects which language’s sorting order is used on strings. For example, Ch is considered a letter in Spanish and would come after Cz. One C library function that uses it is strcoll(), and POSIX commands that do include ls (when you sort files by name) and sort.
LC_CTYPE="en_GB.UTF-8"
This determines the current character encoding. In C11, you can set this and then use wide-character input and output, such as wprintf(). The library will transparently convert between wide characters and the character set used by the outside world. This still doesn’t quite work on Windows, unless you do some extra magic, but elsewhere, UTF-8 has become the standard. An increasing number of programs, such as clang (as of version 7), no longer support anything but UTF-8.
LC_MESSAGES="en_GB.UTF-8"
This determines what language and character set you see localized messages in. In C on Unix/Linux, these would typically be loaded from a .po file by the gettext library.
LC_MONETARY="en_GB.UTF-8"
This affects how strfmon() formats monetary quantities.
LC_NUMERIC="en_GB.UTF-8"
This determines the formatting of numbers that aren’t amounts of money.
LC_TIME="en_GB.UTF-8"
This affects the formatting of time. Try LC_TIME=fr_FR.UTF-8 date in the shell to see an example. (Or use locale -a | grep UTF to select some suitably-exotic locale.) Also a good test of whether your timezone and ntpd are working properly.
LC_ALL=
Use LANG instead of this. It sets every locale category at once, but it overrides the values in all the other locale variables. It exists for backward compatibility.
For example, I use LANG=en_US.utf8 on my Linux box, but I override LC_TIME=en_GB.utf8 to get 24-hour time in English. This would not be possible to do if LC_ALL were set.
LANG also allows your defaults to carry over into whatever other locale information your system supports, such as LC_ADDRESS, LC_IDENTIFICATION, LC_RESPONSE, LC_MEASUREMENT and LC_TELEPHONE.
Related
I am writing my own Posix C library from scratch and I have hit a stumbling block when it comes to internationalization and ctype's. I see in the POSIX standard several functions for the end user programs to set and access locales in the locale.h header but not how to initially store the locale information from the locale file for the libraries use.
Is this just some nonstandard library internal custom to each implimentation?
POSIX specifies the optional localedef utility and a locale source format it can read and convert to whatever data format your implementation uses internally. If you opt to support localedef, then the source structure for locales is data in the localedef format, but you can design whatever intermediary format you like for easy/efficient/whatever access at runtime.
Otherwise, if you're not supporting localedef, how you implement locale is completely up to you. POSIX specifies how various interfaces behave, but not how you achieve those features, nor what degrees of freedom locales might vary by. It's possible for a conforming implementation to have nothing but the C/POSIX locale.
From The Linux Programming Interface:
There are two different methods of setting the locale using setlocale(). The locale
argument may be a string specifying one of the locales defined on the system (i.e.,
the name of one of the subdirectories under /usr/lib/locale), such as de_DE or en_US.
Alternatively, locale may be specified as an empty string, meaning that locale settings should be taken from environment variables:
setlocale(LC_ALL, "");
We must make this call in order for a program to be cognizant of the locale environment variables. If the call is omitted, these environment variables will have no effect on the program.
So per my understaning, if my program doesn't call setlocale function explicitly, my program will use the default locale, which is POSIX in *nix systems, right? I can't search the specified document.
Looking at the man
7.4 How Programs Set the Locale
A C program inherits its locale environment variables when it starts up. This happens automatically. However, these variables do not automatically control the locale used by the library functions, because ISO C says that all programs start by default in the standard ‘C’ locale. To use the locales specified by the environment, you must call setlocale. Call it as follows:
setlocale (LC_ALL, "");
Emphasis mine
I am using setlocale(LC_ALL,"Portuguese") so my program can read brazillian portuguese accents worlds like "joão" from a text file and print it at screen, and it works fine for this purpose. But when i try to input a word like "joão" from the keyboard and using gets() or scanf() the string saved is something different from the input . Any advices ?
If you are expecting terminal input, it is rarely correct to use setlocale in any way other than
setlocale(LC_ALL, "");
That will set the program's locale to the environment's locale. Normally, the locale setting in the interactive environment corresponds to the configuration of the terminal, so it represents the expectation of the interactive user. Changing the program's locale has no effect on the terminal [Note 1], so if you do change it, it will simply mean that the program's locale no longer corresponds to the user's expectations.
It would be correct to setlocale for file input if you provide some mechanism to specify the environment for the file [Note 2]. In Unix, however, the simplest way for the user to specify that is on the command-line:
LC_ALL=pt_BR.utf8 ./my_command the_portuguese_file.utf8
For Windows, you may want to provide a different mechanism to communicate the file's locale to the program. But in the absence of such a declaration, using the locale configured in the environment will usually be the correct option.
The one exception to the above is programs which prefer to be locale-unaware, which may wish to set the locale to "C" (or "POSIX", but "C" does not require a Posix-compatible setlocale). That can be useful to do as a form of self-documentation, but it is not necessary because a program which does not call setlocale at all will be executed in the "C" locale (on most operating systems).
Notes
In most cases, changing the environment's locale by modifying the value of the environment variable LC_ALL also has not effect on the terminal configuration. Indeed, the terminal may not even be part of the environment; for example, if you have a remote ssh/telnet session, or the GUI equivalent. A user should first configure their terminal according to their expectations, and then configure their environment to correspond; they will expect utility programs they run to respect the environment setting.
Aside from the strings "C", "POSIX" and "", there are no standards which will let you even know what possible locale names are, which is yet another reason not to try to set the locale except when the user has asked you to.
First of all, this must be really solved in C, and with UNIX standard C functions (because of project constraints). So, C++ or alternative libraries are outside the scope of the question.
I know how to set the default user locale with setlocale, as well as setting the standard C/POSIX locales.
However, I'm in a situation where the decimal separator is file-specified, so I want my program to temporally change the decimal separator.
LC_NUMERIC expects a locale name... but I don't want to give it a locale name, but the separator character directly.
How can this be done?
Well, I'm afraid you won't like the solution :)
First of all, since you're operating with setlocale you have to supply a locale name. Therefore there should be a locale with LC_NUMERIC property defined by you in the time of program execution. Therefore you need to define a new locale. You may define it with localedef You may use this doc as a guide for making and using a new locale and this site to get source files which you can use as a template for your custom locale definition.
When we invoke system call in linux like 'open' or stdio function like 'fopen' we must provide a 'const char * filename'. My question is what is the encoding used here? It's utf-8 or ascii or iso8859-x? Does it depend on the system or environment setting?
I know in MS Windows there is a _wopen which accept utf-16.
It's a byte string, the interpretation is up to the particular filesystem.
Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).
This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.
It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.
The filename is the byte string; regardless of locale or any other conventions you're using about how filenames should be encoded, the string you must pass to fopen and to all functions taking filenames/pathnames is the exact byte string for how the file is named. For example if you have a file named ö.txt in UTF-8 in NFC, and your locale is UTF-8 encoded and uses NFC, you can just write the name as ö.txt and pass that to fopen. If your locale is Latin-1 based, though, you can't pass the Latin-1 form of ö.txt ("\xf6.txt") to fopen and expect it to succeed; that's a different byte string and thus a different filename. You would need to pass "\xc3\xb6.txt" ("ö.txt" if you interpret that as Latin-1), the same byte string as the actual name.
This situation is very different from Windows, which you seem to be familiar with, where the filename is is a sequence of 16-bit units interpreted as UTF-16 (although AFAIK they need not actually be valid UTF-16) and filenames passed to fopen, etc. are interpreted according to the current locale as Unicode characters which are then used to open/access the file based on its UTF-16 name.
As already mentioned above, this will be a byte string and the interpretation will be open to the underlying system. More specifically, imagine to C functions; one in user space and one in kernel space which take char * as their parameter. The encoding in user space will depend upon the execution character set of the user program (eg. specified by -fexec-charset=charset in gcc). The encoding expected by the kernel function depends upon the execution charset used during kernel compilation (not sure where to get that information).
I did some further inquiries on this topic and came to the conclusion that there are two different ways how filename encoding can be handled by unixoid file systems.
File names are encoded in the "sytem locale", which usually is, but needs not to be the same as the current environment locale that is reflected by the locale command (but some preset in a global configuration file).
File names are encoded in UTF-8, independent from any locale settings.
GTK+ solves this mess by assuming UTF-8 and allowing to override it either by the current locale encoding or a user-supplied encoding.
Qt solves it by assuming locale encoding (and that system locale is reflected in the current locale) and allowing to override it with a user-supplied conversion function.
So the bottom line is: Use either UTF-8 or what LC_ALL or LANG tell you by default, and provide an override setting at least for the other alternative.