I'm new to programming and when trying to write a program in C with Turkish characters in it, the characters are not correct on terminal. Using setlocale(LC_ALL, "Turkish"); works but am I going to write this code everytime a write a new program? Is there a way to "force" this by default so I can get rid of this process?
The C Standard §7.11.1.1 The setlocale function ¶3-4 says:
A value of "C" for locale specifies the minimal environment for C translation; a value of "" for locale specifies the locale-specific native environment. Other implementation-defined strings may be passed as the second argument to setlocale.
At program startup, the equivalent of
setlocale(LC_ALL, "C");
is executed.
Consequently, if you want a different locale to be in use, you have to override this with an explicit call to setlocale().
The POSIX specification of setlocale() conforms to the C standard but also extends it, specify that the second argument can be:
"POSIX"
[CX] ⌦ Specifies the minimal environment for C-language translation called the POSIX locale. The POSIX locale is the default global locale at entry to main(). ⌫
"C"
Equivalent to "POSIX".
""
Specifies an implementation-defined native environment. [CX] ⌦ The determination of the name of the new locale for the specified category depends on the value of the associated environment variables, LC_* and LANG; see XBD Locale and Environment Variables. ⌫
A null pointer
Directs setlocale() to query the current global locale setting and return the name of the locale if category is not LC_ALL, or a string which encodes the locale name(s) for all of the individual categories if category is LC_ALL.
This means that if you write setlocale(LC_ALL, ""); at the start of main(), all users of the program stand the maximum chance of being able to use their preferred locale.
Related
I need to read data from command-line and store data in UTF-8. In order to do that, my approach is to determine which charset is using the command-line shell by retrieving the current locale. (Of course, if you see a better approach, please share your thoughts!)
What values should be expected when trying to detect the LC_CTYPE value for the active locale?
I am using the function below, which expects to get either a string like 'POSIX' or 'C', or something like 'en_US.UTF-8'.
Does anyone know if there are other possible situations (i.e. possible values)?
(My concern being to make sure I handle all cases)
/* Retrieve the current charset using setlocale function.
#return Returned value is a string holding the name of the current charset. On error, function returns NULL.
*/
char* get_charset() {
// read environment locale for LC_CTYPE category
setlocale(LC_CTYPE, "");
char* locale = setlocale(LC_CTYPE, NULL);
if (strstr(locale, ".") != NULL) {
// return codeset (last block of chars preceeded by a dot)
return strrchr(locale, '.')+1;
}
return locale;
}
Actually, POSIX defines a "Portable character set" which is a subset of ASCII, and that is supposed to be part of any standard-compliant character set.
As for the setlocale() function, the official GNU documentation states that when XPG syntax is not used by the platform (i.e. OS), "C" is the fallback value and means "POSIX compliant".
Besides, the returned value is a char pointer (char*), so result should always be either a string or NULL.
So, here are the answers to the question:
Yes. The given snippet should cover all situations.
If the idea is to store the result in UTF-8, no conversion is required in case the get_charset() function returns 'C', since that means the used charset is compatible with ASCII which, in turn, is compatible with UTF-8
Chapter 8 of POSIX standard define a list of commonly used environment variables "that are frequently exported by widely used command interpreters and applications".
However I cannot find any C header providing their names in any of my unix-like systems.
I'm looking for something like:
#define ENV_PATH "PATH"
#define ENV_USER "USER"
#define ENV_IFS "IFS"
...
Where I can find such header? Any OS-specific header would work: I just don't want to invent names for the constants myself.
edit
If you are used to only mainstream operating systems, you might ask: why you want to use constants here? $PATH is always $PATH everywhere!
This is not actually true.
In Plan 9 from Bell Labs, environment variables are usually lowercase (apparently due to aesthetics).
In Jehanne, a new operating system derived by Plan 9, I'm reconsidering this design choice, to ease the integration of POSIX tools. However, since I like the lowercase environment variables, I'd like to be able to easily switch back to lowercase names when Jehanne will be "the one true operating system" :-D
As stated in the comments, there is no header file that provides any POSIX-specified list of environment variables used by applications and utilities.
A list of "certain variables that are frequently exported by widely used command interpreters and applications" can be found at http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08. (The actual environment variable list requires reformatting but here it is anyway...)
It is unwise to conflict with certain variables that are frequently
exported by widely used command interpreters and applications:
ARFLAGS IFS MAILPATH PS1
CC LANG MAILRC PS2
CDPATH LC_ALL MAKEFLAGS PS3
CFLAGS LC_COLLATE MAKESHELL PS4
CHARSET LC_CTYPE MANPATH PWD
COLUMNS LC_MESSAGES MBOX RANDOM
DATEMSK LC_MONETARY MORE SECONDS
DEAD LC_NUMERIC MSGVERB SHELL
EDITOR LC_TIME NLSPATH TERM
ENV LDFLAGS NPROC TERMCAP
EXINIT LEX OLDPWD TERMINFO
FC LFLAGS OPTARG TMPDIR
FCEDIT LINENO OPTERR TZ
FFLAGS LINES OPTIND USER
GET LISTER PAGER VISUAL
GFLAGS LOGNAME PATH YACC
HISTFILE LPDEST PPID YFLAGS
HISTORY MAIL PRINTER
HISTSIZE MAILCHECK PROCLANG
HOME MAILER PROJECTDIR
To access the value of an environment variable, use the getenv() function.
The exec() function documentation specifies the char **environ variable:
In addition, the following variable, which must be declared by the
user if it is to be used directly:
extern char **environ;
is initialized as a pointer to an array of character pointers to the
environment strings. The argv and environ arrays are each terminated
by a null pointer. . The null pointer terminating the argv array is not counted in argc.
Applications can change the entire environment in a single operation
by assigning the environ variable to point to an array of character
pointers to the new environment strings. After assigning a new value
to environ, applications should not rely on the new environment
strings remaining part of the environment, as a call to getenv(),
putenv(), setenv(), unsetenv(), or
any function that is dependent on an environment variable may, on
noticing that environ has changed, copy the environment strings to a
new array and assign environ to point to it.
Any application that directly modifies the pointers to which the
environ variable points has undefined behavior.
Conforming multi-threaded applications shall not use the environ
variable to access or modify any environment variable while any other
thread is concurrently modifying any environment variable. A call to
any function dependent on any environment variable shall be considered
a use of the environ variable to access that environment variable.
You can do something like that and in get_env_variables function you modify what you want. Just create something like a strncmp function for check if you want to modify this variable or not.
int main(int ac, char **av, char **env){
int i = 0;
while (env[i] != NULL){
env[i] = get_env_variables(env[i]);
i++;
}
}
char *get_env_variables(char *str) {
// PUT SOME CODE HERE
}
EDIT : don't forgot to return new env[i].
Am using the C JSON library under Ubuntu (json-c/json.h). I need to parse JSON strings on multiple POSIX threads. Am currently using the json_tokener_parse() method - is this multi-thread safe or do I need to use something else?
thnx
I looked through the code: https://github.com/json-c/json-c/blob/master/json_tokener.c
It appears to be thread-safe with one exception:
#ifdef HAVE_SETLOCALE
char *oldlocale=NULL, *tmplocale;
tmplocale = setlocale(LC_NUMERIC, NULL);
if (tmplocale) oldlocale = strdup(tmplocale);
setlocale(LC_NUMERIC, "C");
#endif
So if HAVE_SETLOCALE is defined (and it probably will be), setlocale() will be called and it will set the process-wide LC_NUMERIC to "C". And of course it undoes this at the end. This will cause problems if your LC_NUMERIC is not "C" or a compatible locale at the beginning, because one thread will "restore" your locale while another one may still be parsing and expecting the "C" locale to be in effect.
Fortunately it is guaranteed that the locale will be "C" on program start, so you just need to make sure that neither you nor any other library you're using sets LC_NUMERIC (or LC_ALL of course) to a locale incompatible with "C". You could then rebuild the library with HAVE_SETLOCALE undefined if you want, but this probably doesn't matter, as its calls to setlocale() will have no real effect.
I have an issue with fgetws and wprintf.
NULL is returned when a special character is fund in the File opened before. I don't have this problem with fgets.
I tried to use setlocale, as recommended here : fgetws fails to get the exact wide char string from FILE*
but it doesn't change nothing.
Moreover, wprintf(L"éé"); prints ?? (I also don't have this problem with printf) in the terminal (on Ubuntu 12), what can be done to avoid this?
Edit : as it is asked in the comments, here is the very simple code :
# include "sys.h"
#define MAX_LINE_LENGTH 1024
int main (void){
FILE *File = fopen("D.txt", "r");
wchar_t line[MAX_LINE_LENGTH];
while (fgetws(line, MAX_LINE_LENGTH, File))
wprintf(L"%S", line);
fclose(File);
return 0;
}
By default, when a program starts, it is running in the C locale, which is not guaranteed to support any characters except those needed for translating C programs. (It can contain more as an implementation detail, but you cannot rely on this.) In order to use wchar_t to store other characters and process them with the wide character conversion functions or wide stdio functions, you need to set a locale in which those characters are supported.
The locales available, and how they are named, vary by system, so you should not attempt to set a locale by name. Instead, pass "" to setlocale to request the "default" locale for the user or the system. On POSIX-like systems, this uses the LANG and LC_* environment variables to determine the preferred locale. As long as the characters you're trying to use exist in the user's locale, your wprintf should work.
The call to setlocale should look like:
setlocale(LC_CTYPE, "");
or:
setlocale(LC_ALL, "");
The former only applies the locale settings to character encoding/character type functions (things that process wchar_t). The latter also causes locale to be set for, and affect, a number of other things like message language, formatting of numbers and time, ...
One detail to note is that wide stdio functions bind the character encoding of the locale that's in use at the time the stream "becomes wide-oriented", i.e. on the first wide operation that's performed on it. So you need to call setlocale before using wprintf.
What is the meaning of setlocale()'s default setting? setlocale() defaults to "C" ("POSIX"). But what does that mean exactly ? Which is its default charset and language ? Is it "en_US.utf8" ?
From N1570:
7.11.1.1 The setlocale function
3 A value of "C" for locale specifies the minimal environment for C translation; a value
of "" for locale specifies the locale-specific native environment. Other
implementation-defined strings may be passed as the second argument to setlocale.
Also, from footnote 222:
222) ISO/IEC 9945−2 specifies locale and charmap formats that may be used to specify locales for C.
This gives you an idea (since a footnote is strictly not part of the normative part of the standard) what "C" means in this context.
The charset for locale "C" is required to contain all of the 7 bit ASCII characters, with the collating sequence based only on ASCII character codes. NO other characters outside ASCII are required. If the text being processed includes any characters outside that limited set, the behavior is undefined. As far as language, all the standard definitions in http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html correspond to US English.