Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and
compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");
if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}
Why neither islower() nor isupper() work on letter я?
The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).
You need to track down the source code for the runtime library to see what it does and why.
The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.
Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)
iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt
On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)
Only once you control exactly what locales are used at each stage, you'll get coherent results.
The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.
If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.
I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.
NOTE
On debian linux:
$ sed 's/^/ /' pru-$$.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>
#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)
int main()
{
setlocale(LC_ALL, "");
Q(0xff);
}
Compiled with
$ make pru-$$
cc pru-1342.c -o pru-1342
execution with ru_RU.CP1251 locale
$ locale | sed 's/^/ /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=
$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512
So, glibc is not faulty, the fault is in your code.
The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument
(probably to be fool-proof).
This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is
promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc,
0xFF becomes -1, which is by unhappy coincidence the value of EOF.
Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).
See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?
Related
I am trying to draw a square with a given width and height.
I am trying to do so while using the box characters from Unicode.
I am using this code:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
#define VERTICAL_PIPE L"║"
#define HORIZONTAL_PIPE L"═"
#define UP_RIGHT_CORNER L"╗"
#define UP_LEFT_CORNER L"╔"
#define DOWN_RIGHT_CORNER L"╝"
#define DOWN_LEFT_CORNER L"╚"
// Function to print the top line
void DrawUpLine(int w){
setlocale(LC_ALL, "");
wprintf(UP_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(UP_RIGHT_CORNER);
}
// Function to print the sides
void DrawSides(int w, int h){
setlocale(LC_ALL, "");
for (int i = 0; i < h; i++)
{
wprintf(VERTICAL_PIPE);
for (int j = 0; j < w; j++)
{
putchar(' ');
}
wprintf(VERTICAL_PIPE);
putchar('\n');
}
}
// Function to print the bottom line
void DrawDownLine(int w){
setlocale(LC_ALL, "");
wprintf(DOWN_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(DOWN_RIGHT_CORNER);
}
void DrawFrame(int w, int h){
DrawUpLine(w);
putchar('\n');
DrawSides(w, h);
putchar('\n');
DrawDownLine(w);
}
But when I am running this code with some int values I get an output with seemingly random spaces and newlines (although the pipes seem at the correct order).
It is being called from main.c from the header like so:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
int main(){
DrawFrame(10, 20); // Calling the function
return 0;
}
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Any help appreciated thanks in advance!
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Locale changes applied via setlocale() are persistent within the calling process. You do not need to call that function multiple times unless you want to make multiple changes. But you do need to name a locale to it that serves your intended purpose, or if you call it with an empty string then you or the program user does need to ensure that the environment variables that define the various locale categories are set to values that suit the purpose.
But when I am running this code with some int values I get an output
with seemingly random spaces and newlines.
That sounds like the result of a character-encoding mismatch, or even two (but see also below):
there can be a runtime mismatch because the locale you tell the program to use for output does not match the one expected by the output device (e.g. a terminal) with which the program's output is displayed, and
there can also be a compile time mismatch between the actual character encoding of your source file and the encoding the compiler interprets it as having.
Additionally, use of wide string literal syntax notwithstanding, it is implementation-dependent which characters other than C's basic set may appear in your source code. The wide syntax specifies mostly the form of the storage for the literal (elements of type wchar_t), not so much what character values are valid or how they are interpreted.
Note also that the width of wchar_t is implementation-dependent, and it can be as small as eight bits. It is not necessarily the case that a wchar_t can represent arbitrary Unicode characters -- in fact, it is pretty common for wchar_t to be 16 bits wide, which in fact isn't wide enough for the majority of characters from Unicode's 21-bit code space. You might get an internal representation of wider characters in a two-unit form, such as a UTF-16 surrogate pair, but you also might not -- a great deal of this is left to individual implementations.
Among those things, what encoding the compiler expects, under what circumstances, and how you can influence that are all implementation-dependent. For GCC, for instance, the default source ("input") character set is UTF-8, and you can define a different one via its -finput-charset option. You can also specify both a standard and a wide execution character set via the -fexec-charset and -fwide-exec-charset options, if you wish to do so. GCC relies on iconv for conversions, both at compile time (source charset to execution charset) and at runtime (from execution charset to locale charset). Other implementations have other options (or none), with their own semantics.
So what should you do? In the first place, I suggest taking the source character set out of the equation by using UTF-8 string literals expressed using only basic character set (requires C2011):
#define VERTICAL_PIPE u8"\xe2\x95\x91"
#define HORIZONTAL_PIPE u8"\xe2\x95\x90"
#define UP_RIGHT_CORNER u8"\xe2\x95\x97"
#define UP_LEFT_CORNER u8"\xe2\x95\x94"
#define DOWN_RIGHT_CORNER u8"\xe2\x95\x9d"
#define DOWN_LEFT_CORNER u8"\xe2\x95\x9a"
Note well that the resulting strings are normal, not wide, so you should not use the wide-oriented output functions with them. Instead, use the normal printf, putchar, etc..
And that brings us to another issue with your code: you must not mix wide-oriented and byte-oriented functions writing to the same stream without taking explicit measures to switch (freopen or fwide; see paragraph 7.21.2/4 of the standard). In practice, mixing the two can quite plausibly produce mangled results.
Then also ensure that your local environment variables are set correctly for your actual environment. Chances are good that they already are, but it's worth a check.
I have to save in a char[] the letter ñ and I'm not being able to do it. I tried doing this:
char example[1];
example[0] = 'ñ';
When compiling I get this:
$ gcc example.c
error: character too large for enclosing
character literal type
example[0] = 'ñ';
Does anyone know how to do this?
If you're using High Sierra, you are presumably using a Mac running macOS 10.13.3 (High Sierra), the same as me.
This comes down to code sets and locales — and can get tricky. Mac terminals use UTF-8 by default and ñ is Unicode character U+00F1, which requires two bytes, 0xC3 and 0xB1, to represent it in UTF-8. And the compiler is letting you know that one byte isn't big enough to hold two bytes of data. (In the single-byte code sets such as ISO 8859-1 or 8859-15, ñ has character code 0xF1 — 0xF1 and U+00F1 are similar, and this is not a coincidence; Unicode code points U+0000 to U+00FF are the same as in ISO 8859-1. ISO 8859-15 is a more modern variant of 8859-1, with the Euro symbol € and 7 other variations from 8859-1.)
Another option is to change the character set that your terminal works with; you need to adapt your code to suit the code set that the terminal uses.
You can work around this by using wchar_t:
#include <wchar.h>
void function(void);
void function(void)
{
wchar_t example[1];
example[0] = L'ñ';
putwchar(example[0]);
putwchar(L'\n');
}
#include <locale.h>
int main(void)
{
setlocale(LC_ALL, "");
function();
return 0;
}
This compiles; if you omit the call to setlocale(LC_ALL, "");, it doesn't work as I want (it generates just octal byte \361 (aka 0xF1) and a newline, which generates a ? on the terminal), whereas with setlocale(), it generates two bytes (\303\261 in octal, aka 0xC3 and 0xB1) and you see ñ on the console output.
You can use "extended ascii". This chart shows that 'ñ' can be represented in extended ascii as 164.
example[0] = (char)164;
You can print this character just like any other character
putchar(example[0]);
As noted in the comments above, this will depend on your environment. It might work on your machine but not another one.
The better answer is to use unicode, for example:
wchar_t example = '\u00F1';
This really depends on which character set / locale you will be using. If you want to hardcode this as a latin1 character, this example program does that:
#include <cstdio>
int main() {
char example[2] = {'\xF1'};
printf("%s", example);
return 0;
}
This, however, results in this output on my system that uses UTF-8:
$ ./a.out
�
So if you want to use non-ascii strings, I'd recommend not representing them as char arrays directly. If you really need to use char directly, the UTF-8 sequence for ñ is two chars wide, and can be written as such (again with a terminating '\0' for good measure):
char s[3] = {"\xC3\xB1"};
I have a problem trying to read extended ASCII chars in NCURSES.
I have this program:
#include <ncurses.h>
int main () {
initscr();
int d = getch();
mvprintw(0, 0, "letter: %c.", d);
refresh();
getch();
endwin();
return 0;
}
I build it with: gcc -lncursesw a.c
If I type a character in the 7bit ascii, like the 'e' char, I get:
letter: e.
And then I have to type another for the program to end.
If I type a character in the extended ascii, like the 'á' char, I get:
letter: .
and the program ends.
Its like the second byte is read as another character.
How can I get the correct char 'á' ???
Thanks!
The characters that you want to type require the program to setup the locale. As described in the manual:
Initialization
The library uses the locale which the calling program has
initialized. That is normally done with setlocale:
setlocale(LC_ALL, "");
If the locale is not initialized, the library assumes that
characters are printable as in ISO-8859-1, to work with
certain legacy programs. You should initialize the locale
and not rely on specific details of the library when the
locale has not been setup.
Past that, it is likely that your locale uses UTF-8. To work with UTF-8, you should compile and link against the ncursesw library.
Further, the getch function only returns values for single-byte encodings, such as ISO-8859-1, which some people confuse with Windows cp1252, and thence to "Extended ASCII" (which says something about two fallacies not cancelling out). UTF-8 is a multibyte encoding. If you use getch to read that, you will get the first byte of the character.
Instead, to read UTF-8, you should use get_wch (unless you want to decode the UTF-8 yourself). Here is a revised program which does that:
#include <ncurses.h>
#include <locale.h>
#include <wchar.h>
int
main(void)
{
wint_t value;
setlocale(LC_ALL, "");
initscr();
get_wch(&value);
mvprintw(0, 0, "letter: %#x.", value);
refresh();
getch();
endwin();
return 0;
}
I printed the result as a number, because printw does not know about Unicode values. printw uses the same C runtime support as printf, so you may be able to print the value directly. For instance, I see that POSIX printf has a formatting option for handling wint_t:
c
The int argument shall be converted to an unsigned char, and the resulting byte shall be written.
If an l (ell) qualifier is present, the wint_t argument shall be converted as if by an ls conversion specification with no precision and an argument that points to a two-element array of type wchar_t, the first element of which contains the wint_t argument to the ls conversion specification and the second element contains a null wide character.
Since ncurses works on many platforms, not all of those actually support the feature. But you can probably assume it works with the GNU C library: most distributions routinely provide workable locale configurations.
Doing that, the example is more interesting:
#include <ncurses.h>
#include <locale.h>
#include <wchar.h>
int
main(void)
{
wint_t value;
setlocale(LC_ALL, "");
initscr();
get_wch(&value);
mvprintw(0, 0, "letter: %#x (%lc).", value, value);
refresh();
getch();
endwin();
return 0;
}
I would like to convert (transliterate) UTF-8 characters to be closest match in ASCII in C. Characters like ú is transliterated to u. I can do that with iconv, with iconv -f utf-8 -t ascii//TRANSLIT, on the command line.
In C, there is a function towctrans to do that, but I only found documentation about two possible transliterations: to lower case and to upper case (see man wctrans). On the documentation, wctrans depends on LC_CTYPE. But what other function (other than "tolower" and "toupper") are available for a specific LC_CTYPE value?
A simple example with towctrans and the basic toupper transliteration:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t frase[] = L"Amélia"; int i;
setlocale(LC_ALL, "");
for (i=0; i < wcslen(frase); i++) {
printf("%lc --> %lc\n", frase[i], towctrans(frase[i], wctrans("toupper")));
}
}
I know I can do this conversion with libiconv, but I was trying to find out possible already defined wctrans functions.
While the standard allows for implementation-defined or locale-defined transformations via wctrans, I'm not aware of any existing implementations that offer such a feature, and it's certainly not widespread. The iconv approach of //TRANSLIT is also non-standard and in fact conflicts with the standard: POSIX requires a charset name containing a slash character to be interpreted as a pathname to a charmap file, so use of slash for specifying translit-mode is non-conforming.
Edit:
I can only use stdio.h and stdlib.h
I would like to iterate through a char array filled with chars.
However chars like ä,ö take up twice the space and use two elements.
This is where my problem lies, I don't know how to access those special chars.
In my example the char "ä" would use hmm[0] and hmm[1].
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char* hmm = "äö";
printf("%c\n", hmm[0]); //i want to print "ä"
printf("%i\n", strlen(hmm));
return 0;
}
Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element.
So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* hmm = "äüö";
printf("%c\n", hmm[0]);
printf("%c\n", hmm[1]);
printf("%c\n", hmm[2]);
return 0;
}
A char always used one byte.
In your case you think that "ä" is one char: Wrong.
Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8
Now the question is do you want to use wide character ?
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main()
{
const wchar_t hmm[] = L"äö";
setlocale(LC_ALL, "");
wprintf(L"%ls\n", hmm);
wprintf(L"%lc\n", hmm[0]);
wprintf(L"%i\n", wcslen(hmm));
return 0;
}
Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void)
{
char* hmm = "äö";
int off = 0;
int len;
int max = strlen(hmm);
setlocale(LC_ALL, "");
printf("<<%s>>\n", hmm);
printf("%zi\n", strlen(hmm));
while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
{
printf("<<%.*s>>\n", len, &hmm[off]);
off += len;
}
return 0;
}
On my Mac, it produced:
<<äö>>
4
<<ä>>
<<ö>>
The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:
<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>
The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.
You can also use wide characters and wide-character printing, as shown in benjarobin's answer..
Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.
Firstly you can open Character Map, and find that
äö
Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.
The answer, of using
setlocale( LC_ALL, "" )
Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.
Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.
Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.
Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.
setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.
If it was a multie byte character set problem then there would be a whole list of other issues:
Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.
Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.
To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.
For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.
In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>
// File saved as UTF-8, with characters outside the ASCII range
int main()
{
// Set the OEM code page to be the default OS code page
setlocale(LC_ALL, "");
// äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
// Thus, requires a lead byte per unicode code point when saving as UTF-8
char* hmm = "äö";
printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
printf("Length:%d\n", strlen(hmm));
// Convert the UTF-8 String to a wide character
int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
LPWSTR lpszW = new WCHAR[nLen];
MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);
// Print it
wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW);
// Free the memory
delete[] lpszW;
int c = getchar();
return 0;
}
UTF-8 file string using Windows 1252 code page read as:äö
Length:4
wprintf wide character of UTF-8 string: äö
i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.