I have a problem trying to read extended ASCII chars in NCURSES.
I have this program:
#include <ncurses.h>
int main () {
initscr();
int d = getch();
mvprintw(0, 0, "letter: %c.", d);
refresh();
getch();
endwin();
return 0;
}
I build it with: gcc -lncursesw a.c
If I type a character in the 7bit ascii, like the 'e' char, I get:
letter: e.
And then I have to type another for the program to end.
If I type a character in the extended ascii, like the 'á' char, I get:
letter: .
and the program ends.
Its like the second byte is read as another character.
How can I get the correct char 'á' ???
Thanks!
The characters that you want to type require the program to setup the locale. As described in the manual:
Initialization
The library uses the locale which the calling program has
initialized. That is normally done with setlocale:
setlocale(LC_ALL, "");
If the locale is not initialized, the library assumes that
characters are printable as in ISO-8859-1, to work with
certain legacy programs. You should initialize the locale
and not rely on specific details of the library when the
locale has not been setup.
Past that, it is likely that your locale uses UTF-8. To work with UTF-8, you should compile and link against the ncursesw library.
Further, the getch function only returns values for single-byte encodings, such as ISO-8859-1, which some people confuse with Windows cp1252, and thence to "Extended ASCII" (which says something about two fallacies not cancelling out). UTF-8 is a multibyte encoding. If you use getch to read that, you will get the first byte of the character.
Instead, to read UTF-8, you should use get_wch (unless you want to decode the UTF-8 yourself). Here is a revised program which does that:
#include <ncurses.h>
#include <locale.h>
#include <wchar.h>
int
main(void)
{
wint_t value;
setlocale(LC_ALL, "");
initscr();
get_wch(&value);
mvprintw(0, 0, "letter: %#x.", value);
refresh();
getch();
endwin();
return 0;
}
I printed the result as a number, because printw does not know about Unicode values. printw uses the same C runtime support as printf, so you may be able to print the value directly. For instance, I see that POSIX printf has a formatting option for handling wint_t:
c
The int argument shall be converted to an unsigned char, and the resulting byte shall be written.
If an l (ell) qualifier is present, the wint_t argument shall be converted as if by an ls conversion specification with no precision and an argument that points to a two-element array of type wchar_t, the first element of which contains the wint_t argument to the ls conversion specification and the second element contains a null wide character.
Since ncurses works on many platforms, not all of those actually support the feature. But you can probably assume it works with the GNU C library: most distributions routinely provide workable locale configurations.
Doing that, the example is more interesting:
#include <ncurses.h>
#include <locale.h>
#include <wchar.h>
int
main(void)
{
wint_t value;
setlocale(LC_ALL, "");
initscr();
get_wch(&value);
mvprintw(0, 0, "letter: %#x (%lc).", value, value);
refresh();
getch();
endwin();
return 0;
}
Related
I would like to store the result of reading in a wide string using swscanf into a narrow string, but I would like to read all input until a new-line character. I have the following simple example code to demonstrate what I mean:
#include <stdio.h>
int main()
{
const wchar_t* source_string = L"Hello world\n";
char new_string[100];
swscanf(source_string, L"%h[^\n]", new_string);
wprintf(L"%hs", new_string);
return 0;
}
This works fine with gcc, but produces garbage output with MSVC, with a warning:
warning C4475: 'swscanf' : length modifier 'h' cannot be used with type field character ']' in format specifier
Note that wprintf above is just to test and see the output.
Is there a way around this? Is this just MSVC not complying to the actual ANSI C printf/scanf standards?
The scan-set %[...] by default uses narrow characters, even for swscanf.
The h prefix is not supported for %[...] in standard C.
Also note that the h prefix for %s (i.e. %hs) is a MSVC extension, it's not defined in standard C.
I have to save in a char[] the letter ñ and I'm not being able to do it. I tried doing this:
char example[1];
example[0] = 'ñ';
When compiling I get this:
$ gcc example.c
error: character too large for enclosing
character literal type
example[0] = 'ñ';
Does anyone know how to do this?
If you're using High Sierra, you are presumably using a Mac running macOS 10.13.3 (High Sierra), the same as me.
This comes down to code sets and locales — and can get tricky. Mac terminals use UTF-8 by default and ñ is Unicode character U+00F1, which requires two bytes, 0xC3 and 0xB1, to represent it in UTF-8. And the compiler is letting you know that one byte isn't big enough to hold two bytes of data. (In the single-byte code sets such as ISO 8859-1 or 8859-15, ñ has character code 0xF1 — 0xF1 and U+00F1 are similar, and this is not a coincidence; Unicode code points U+0000 to U+00FF are the same as in ISO 8859-1. ISO 8859-15 is a more modern variant of 8859-1, with the Euro symbol € and 7 other variations from 8859-1.)
Another option is to change the character set that your terminal works with; you need to adapt your code to suit the code set that the terminal uses.
You can work around this by using wchar_t:
#include <wchar.h>
void function(void);
void function(void)
{
wchar_t example[1];
example[0] = L'ñ';
putwchar(example[0]);
putwchar(L'\n');
}
#include <locale.h>
int main(void)
{
setlocale(LC_ALL, "");
function();
return 0;
}
This compiles; if you omit the call to setlocale(LC_ALL, "");, it doesn't work as I want (it generates just octal byte \361 (aka 0xF1) and a newline, which generates a ? on the terminal), whereas with setlocale(), it generates two bytes (\303\261 in octal, aka 0xC3 and 0xB1) and you see ñ on the console output.
You can use "extended ascii". This chart shows that 'ñ' can be represented in extended ascii as 164.
example[0] = (char)164;
You can print this character just like any other character
putchar(example[0]);
As noted in the comments above, this will depend on your environment. It might work on your machine but not another one.
The better answer is to use unicode, for example:
wchar_t example = '\u00F1';
This really depends on which character set / locale you will be using. If you want to hardcode this as a latin1 character, this example program does that:
#include <cstdio>
int main() {
char example[2] = {'\xF1'};
printf("%s", example);
return 0;
}
This, however, results in this output on my system that uses UTF-8:
$ ./a.out
�
So if you want to use non-ascii strings, I'd recommend not representing them as char arrays directly. If you really need to use char directly, the UTF-8 sequence for ñ is two chars wide, and can be written as such (again with a terminating '\0' for good measure):
char s[3] = {"\xC3\xB1"};
My main language is portuguese so we have some accented words (with á é í ó ú... etc characters) i'm trying to read and store those characters into a variable but it just doesn't work. If i just set it on the code it works, but if i ask the user for input it doesn't. Example code:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main(int argc, char *argv[]) {
setlocale(LC_ALL, "Portuguese");
char test, test2; //The same still happens using unsigned char
test = 'í';
printf("Character: %c\n", test);
scanf(" %c", &test2); //The same still happens using fgets in case of a string
printf("Character: %c\n", test2);
system("pause");
return 0;
}
When compiled and executed the code shows:
Character: í
(wait for input, example:) í
character: ¡
if input is 'á' it prints ' '(space), 'é' prints ', ó prints '¢' and ú prints '£'.
I'm new into programming and stackoverflow, so sorry for any mistake i made, every help is appreciated, thank you.
oh, also I'm using Dev-c++ to compile if this make any difference.
You need to recognize that a char in C is a numeric type of size 1 byte. It actually is not exactly intended to keep the representation of a single language character item. (Sometimes called code point).
You do have two options to deal with this situation:
Use a character encoding that is single byte. (E.g. the proper
version of the iso-8859 family, iso-8859-1 in your case). This
will ensure that all characters will fit into a single byte.
deal with your input with proper mechanisms for multibyte
characters. You might look for char16_t or char32_t types and
maybe turn to using wchar_t and related library routines
Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and
compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");
if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}
Why neither islower() nor isupper() work on letter я?
The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).
You need to track down the source code for the runtime library to see what it does and why.
The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.
Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)
iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt
On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)
Only once you control exactly what locales are used at each stage, you'll get coherent results.
The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.
If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.
I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.
NOTE
On debian linux:
$ sed 's/^/ /' pru-$$.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>
#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)
int main()
{
setlocale(LC_ALL, "");
Q(0xff);
}
Compiled with
$ make pru-$$
cc pru-1342.c -o pru-1342
execution with ru_RU.CP1251 locale
$ locale | sed 's/^/ /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=
$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512
So, glibc is not faulty, the fault is in your code.
The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument
(probably to be fool-proof).
This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is
promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc,
0xFF becomes -1, which is by unhappy coincidence the value of EOF.
Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).
See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?
I would like to convert (transliterate) UTF-8 characters to be closest match in ASCII in C. Characters like ú is transliterated to u. I can do that with iconv, with iconv -f utf-8 -t ascii//TRANSLIT, on the command line.
In C, there is a function towctrans to do that, but I only found documentation about two possible transliterations: to lower case and to upper case (see man wctrans). On the documentation, wctrans depends on LC_CTYPE. But what other function (other than "tolower" and "toupper") are available for a specific LC_CTYPE value?
A simple example with towctrans and the basic toupper transliteration:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t frase[] = L"Amélia"; int i;
setlocale(LC_ALL, "");
for (i=0; i < wcslen(frase); i++) {
printf("%lc --> %lc\n", frase[i], towctrans(frase[i], wctrans("toupper")));
}
}
I know I can do this conversion with libiconv, but I was trying to find out possible already defined wctrans functions.
While the standard allows for implementation-defined or locale-defined transformations via wctrans, I'm not aware of any existing implementations that offer such a feature, and it's certainly not widespread. The iconv approach of //TRANSLIT is also non-standard and in fact conflicts with the standard: POSIX requires a charset name containing a slash character to be interpreted as a pathname to a charmap file, so use of slash for specifying translit-mode is non-conforming.