How are locale files initially read by the C library - c

I am writing my own Posix C library from scratch and I have hit a stumbling block when it comes to internationalization and ctype's. I see in the POSIX standard several functions for the end user programs to set and access locales in the locale.h header but not how to initially store the locale information from the locale file for the libraries use.
Is this just some nonstandard library internal custom to each implimentation?

POSIX specifies the optional localedef utility and a locale source format it can read and convert to whatever data format your implementation uses internally. If you opt to support localedef, then the source structure for locales is data in the localedef format, but you can design whatever intermediary format you like for easy/efficient/whatever access at runtime.
Otherwise, if you're not supporting localedef, how you implement locale is completely up to you. POSIX specifies how various interfaces behave, but not how you achieve those features, nor what degrees of freedom locales might vary by. It's possible for a conforming implementation to have nothing but the C/POSIX locale.

Related

Which components use the locale variables?

I have read that every process has a set of locale variables associated with it. For example, these are the locale variables associated with the bash process on my system:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I want to know who actually uses these locale variables.
Do the C standard functions (for example: fwrite()) and the Linux system calls use them? Does the behavior of some C standard functions or some Linux system call differ depending on the value of some locale variable?
Or is it only certain programs that can use these locale variables? For example, I can write a program that will display messages to the user in a different language depending on the value of the LANG locale variable.
By default, C's standard library functions use the "C" locale. You can switch it to the user locale to enable locale-specific:
Character handling
Collating
Date/time formatting
Numeric editing
Monetary formatting
Messaging
POSIX setlocale documentation contains an incomplete list of locale-dependent functions affected by it:
catopen, exec, fprintf, fscanf, isalnum, isalpha, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswblank, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, localeconv, mblen, mbstowcs, mbtowc, newlocale, nl_langinfo, perror, psiginfo, setlocale, strcoll, strerror, strfmon, strftime, strsignal, strtod, strxfrm, tolower, toupper, towlower, towupper, uselocale, wcscoll, wcstod, wcstombs, wcsxfrm, wctomb
E.g.:
printf("%'d\n", 1000000000);
printf("Setting LC_ALL to %s\n", getenv("LANG"));
setlocale(LC_ALL, ""); // Set user-preferred locale.
printf("%'d\n", 1000000000);
Outputs:
1000000000
Setting LC_ALL to en_US.UTF-8
1,000,000,000
I have read that every process has a set of locale variables associated with it.
That's not really true, or at least it is highly over-simplified.
Many standard library functions (and non-standard library functions) modify their behaviour based on a set of locale configurations which are maintained in some hidden global object within the standard library implementation. (In some library implementations, the locale configuration is maintained per-thread rather than globally, using thread-local static variables.) That may seem to be associated with a process, since typically each process has a single instance of the standard library's runtime, but it's important to understand that -- despite appearances -- locale support is part of the library, not the OS kernel. (Of course, nothing in any standard defines where the kernel's boundaries are, or even what a kernel might be. You could run your program "bare metal" or you might have an OS which considers it useful to implement the standard library within system calls. I'm talking here about common cases.)
Basic locale configuration is defined by the C standard in section 7.11 (of the C11 standard), which defines two interfaces:
setlocale, which modifies the library's locale configuration, and
localeconv, which queries part of the locale configuration, allowing user code to conform to the locale's numeric formatting conventions (including monetary formatting).
The locale configuration is divided into a number of more-or-less independent components, called "categories". (The C++ standard library calls these "facets", which is also a commonly-used word.) There are five categories defined by the C standard and one more defined by Posix, but the categories are open-ended; individual standard library implementations are free to add additional categories. For example, the Gnu standard C library used on most Linux systems currently has a total of 12 categories. (See man 7 locale on your system for a current list.)
The standard categories are:
LC_CTYPE: Character classification and case conversion.
LC_COLLATE: Collation order.
LC_MONETARY: Monetary formatting.
LC_NUMERIC: Numeric, non-monetary formatting.
LC_TIME: Date and time formats.
and the Posix extension is:
LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.
Aside from localeconv, which only provides access to specific configurations from the LC_NUMERIC and LC_MONETARY categories, there is no way to query any specific configuration.
Also, there is no standard way at all to set a single configuration. All you can do is use setlocale to configure an entire category, using a library-dependent and non-standardised locale name (which is just a character string). More precisely, two locale names are standardised:
The C standard defines the locale name C.
Posix defines the locale name POSIX. However, Posix specifies that the corresponding locale shall be identical to the locale named C.
The details for locale-naming are (or should be) detailed in the locale documentation for the environment you're working in, but normally a locale-aware program will never call setlocale with a string constant other than the standard names, or the empty string. (I'll get to that in a minute.)
The setlocale interface allows the program to set an individual locale category, or to set all locale categories to the same locale name. It also returns a string which can be used to return to a previously configured locale category (or complete configuration).
The category names shown in the list of categories above are macros defined in <locale.h>. An additional macro, LC_ALL, is also defined by that header file: LC_ALL. One of these macros must be used as the first argument to setlocale.
The C and Posix standards both require that the initial locale setting on program startup is the C locale. Many aspects of the C locale are standardised (and somewhat more aspects of the Posix locale are standardised). This standardisation allows a programmer to predict how numeric conversions will work, for example.
But it is often the case that a programmer will want to interact with the program's user with that user's own locale preferences. It is obviously not desirable that every single program have its own idiosyncratic mechanism for determining what the user's locale preferences are, so the standard library provides a mechanism for setting the locale (or individual locale categories) to whatever the default locale is configured to: calling setlocale with the empty string ("") as a locale name. The C standard does not specify any particular mechanism for configuring this information; it merely assumes that one exists.
(Side note: Calling setlocale with an empty string as locale name is not the same as calling setlocale with NULL as locale name. NULL tells setlocale to not change any locale setting, but it will still return the string associated with the current locale. This avoids the need for a getlocale interface.)
Posix does specify a mechanism for configuring user preferences, and it also insists that (most) standardised command-line utilities operate in the default locale. That mechanism uses environment variables whose names correspond to the setlocale category macros.
On a Posix implementation, when the program calls setlocale(LC_X, ""); the library will proceed to examine the current environment:
First, it looks for the environment variable LC_ALL. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the first argument to setlocale was not LC_ALL it looks for the environment variable whose name is the same as that argument. If that is defined and has a non-empty value, it is used to define the locale.
Otherwise, if the environment variable LANG is defined and has a non-empty value, it is used (in some implementation dependent way) to construct a locale name. (LANG is supposed to indicate the user's language, which is an important part of their locale preferences.)
Finally, some system-wide default is used.
Environment variables are generally initialised by the login program (or GUI equivalent) on the basis of system configuration files. (The precise mechanism varies from distribution to distribution and documentation is often difficult to find.)
As mentioned, almost all standard shell utilities are required by Posix to do the equivalent of setlocale(LC_ALL, ""); in order to operate in the user's configured locale. Every utility's manpage (or other documentation) should specify whether it does this or not, but it's reasonable to assume that it does unless there is some information to the contrary.
Also, many (but not all) standard library string functions are locale-aware. Library interfaces which are definitely not locale-aware include isdigit and isxdigit, which always respond on the basis of the C locale, and strcmp, which compares strings in the same way as memcmp, using the char value (interpreted as an unsigned int) to determine collation order. (strcoll is locale-aware, if you want to do comparison according to LC_COLLATE.) And the character encodings used for wide and multibyte characters are controlled (in some unspecified way) by the LC_CTYPE category.
Many programs set the locale, and use it at least for internationalization. Some specific examples:
LANG="en_GB.UTF-8"
This is the locale for any category you didn’t specifically set to something else. It allows the system to add new locale variables in a backward-compatible way.
LC_COLLATE="en_GB.UTF-8"
This selects which language’s sorting order is used on strings. For example, Ch is considered a letter in Spanish and would come after Cz. One C library function that uses it is strcoll(), and POSIX commands that do include ls (when you sort files by name) and sort.
LC_CTYPE="en_GB.UTF-8"
This determines the current character encoding. In C11, you can set this and then use wide-character input and output, such as wprintf(). The library will transparently convert between wide characters and the character set used by the outside world. This still doesn’t quite work on Windows, unless you do some extra magic, but elsewhere, UTF-8 has become the standard. An increasing number of programs, such as clang (as of version 7), no longer support anything but UTF-8.
LC_MESSAGES="en_GB.UTF-8"
This determines what language and character set you see localized messages in. In C on Unix/Linux, these would typically be loaded from a .po file by the gettext library.
LC_MONETARY="en_GB.UTF-8"
This affects how strfmon() formats monetary quantities.
LC_NUMERIC="en_GB.UTF-8"
This determines the formatting of numbers that aren’t amounts of money.
LC_TIME="en_GB.UTF-8"
This affects the formatting of time. Try LC_TIME=fr_FR.UTF-8 date in the shell to see an example. (Or use locale -a | grep UTF to select some suitably-exotic locale.) Also a good test of whether your timezone and ntpd are working properly.
LC_ALL=
Use LANG instead of this. It sets every locale category at once, but it overrides the values in all the other locale variables. It exists for backward compatibility.
For example, I use LANG=en_US.utf8 on my Linux box, but I override LC_TIME=en_GB.utf8 to get 24-hour time in English. This would not be possible to do if LC_ALL were set.
LANG also allows your defaults to carry over into whatever other locale information your system supports, such as LC_ADDRESS, LC_IDENTIFICATION, LC_RESPONSE, LC_MEASUREMENT and LC_TELEPHONE.

<search.h> header file not available

I can't find the search.h header file mentioned in spell.c, and hence the compiler can't find hcreate(), hsearch() and ENTRY.
Ref:
http://marcelotoledo.com/how-to-write-a-spelling-corrector/
https://github.com/marcelotoledo/spelling_corrector/blob/master/spell.c
The <search.h> header is a POSIX-standard header — and the library functions it declares include:
hash search (hsearch())
linear search (lsearch())
tree search (tsearch())
Those pages each list the set of relevant functions for a particular search. Note that binary search, aka bsearch(), is defined by the C standard rather than POSIX.
The functions were part of Unix SVR4 (and possibly other System V versions), and made it into the Single Unix Specification and hence POSIX too.
If your system doesn't support the header, then it isn't strictly POSIX compliant. You can certainly find implementations of the functions on the web (BSD, Linux — and probably other places too). You may be able to find a version to download for your system. (Macs have it already; I'd expect to find AIX, HP-UX, Solaris include it by default, too.)

How to get DateSeparator in C

I wish to get the date separator accordingly to the system's format's settings.
In Delphi I'm using System.SysUtils.TFormatSettings.DateSeparator, is there something like this in C?
The C language and standard library do not provide any such information. This information can be obtained from whatever system you are targeting. How you do that is dependent on which system you target.
While it is true as stated by the other answers that the C standard does not provide this information, posix does. So on any posix compliant system you can use the nl_langinfo api to get locale information.
You will not get the date separator though, but the date format that can be used by strftime to display the correctly formatted date for the locale. You can always parse this to get the date separator if that's what you really need, but for most uses the actual date format will probably be more useful.
Posix support for Windows is flakey, so you probably have to treat it differently, but for pretty much any other platform this should help you.
The C language and standard library do not provide any such
information. This information can be obtained from whatever system you
are targeting.
Thanks for clarification, targets could be both linux and windows. In those cases, how to do what I asked?
You need different functions for different OS.
e.g.
GetDataSeparator()
{
#ifdef _WIN32
Here goes implementation for Windows
#else
May be more preprocessor commands and implementations for Linux/Mac
#endif
}
Not sure if I understood everything correctly.
For Windows you can parse values found in HKCU\Control Panel\International. For Linux this link might be helpful.

How to deal with Unicode paths in a cross-platfrom C library?

I'm contributing to a C library. It has a function that takes a char* parameter for a file path name. The authors are mostly UNIX developers, and this works fine on unixes where char* mostly means UTF-8. (At least in GCC, the character set is configurable and UTF-8 is the default.)
However, char* means ANSI on Windows, which implies that it is currently impossible to use Unicode path names with this library on Windows, where wchar_t* should be used and only UTF-16 is supported. (A quick search on StackOverflow reveals that the ANSI Windows API functions can not be used with UTF-8.)
The question is, what is the right way to deal with this? We've come up with various ways to do it, but neither of us are Windows experts, so we can't really decide how to do it properly. Our goal is that the users of the library should be able to write cross-platform code that would work on unixes as well as windows.
Under the hood, the library has #ifdefs in place to differentiate between operating systems so that it can use POSIX functions on UNIXes and Win32 APIs on Windows.
So far, we've come up with the following possibilities:
Offer a separate windows-only function that accepts a wchar_t*.
Require UTF-16 on Windows and #ifdef the library header in such a way that the function would accept wchar_t* on Windows.
Add a flag that would tell the function to cast the given char* to wchar_t* and call the widechar Windows APIs.
Create a variant of the function that takes a file descriptor (or file handle on Windows) instead of a file path.
Always require UTF-8 (even on Windows), and then inside the function, convert UTF-8 to UTF-16 and call the widechar Windows APIs.
The problem with options 1-4 is that they would require the user to consciously take care of portability themselves. Option 5 sounds good, but I'm not sure if this is the right way to go.
I'm also open to other suggestions or ideas that can solve this. :)
Since portability is an important goal for you, I think it is imperative for your function semantics to be precisely defined. Among other things, that means that the arguments' types and meanings don't vary across platforms. So, if you have a function that accepts regular char based paths then it should accept such paths on all systems, and the encoding expected of those paths should be well-defined (which does not necessarily mean "the same"). That rules out options (2) and (3).
Moreover, portability requires the same functions to be usable across all platforms; that rules out (1). Option (4) could be ok if a stream- and/or file descriptor-based approach were the only one provided by your library, but it yields portability only with respect to those functions, not with respect to the path-based ones. (And note that stream (FILE *) APIs are defined by C, whereas file descriptors are a POSIX concept, not native to C. In principle, therefore, streams are more portable than file descriptors.)
(5) could work, but it places stronger constraints than you actually need. It is not essential for the function to define the encoding expected (though it can); it suffices for it to define how that encoding is determined.
Additionally, you could add wchar_t-based functions that work everywhere (as opposed to Windows-only). Those might be more convenient for Windows users. Similar to alternative (4), however, that provides portability only with respect to those functions. Supposing that you don't want to drop the char-based ones, you would need to pair this alternative with some variation on (5).

Adding unicode support to a library for Windows

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.
The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.
Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.

Resources