Read / Write special characters (like tilde, ñ,...) in a console application C - c

I'm trying that a C console application can read (using the keyboard) special Spanish characters such as accents, 'ñ', etc in a scanf or gets and then, print it too with printf.
I have achieved to show these characters correctly (stored in a variable or, directly, from printf) thanks to the package locale.h. I show an example:
#include <stdio.h>
// Add languaje package
#include <locale.h>
int main(void)
{
char string[254];
// Set languaje to Spanish
setlocale(LC_ALL, "spanish");
// Show correctly spanish special chars
printf("¡Success!. It is shown special chars like 'ñ' or 'á'.\n\n\n");
// Gets special chars by keyboard
printf("Input spanish special chars (such 'ñ'): ");
gets(string);
printf("Your string is: %s", string);
return 0;
}
but I have not yet achieved to pick them up correctly with the functions mentioned above.
Does anyone know how to do it?
Thank you.
EDIT 1:
In testing, I observed that:
setlocale(LC_ALL, "spanish"); It shows the characters of the Spanish correctly, but it does not collect them from the keyboard.
setlocale(LC_ALL, "es_ES"); It picks up the Spanish characters correctly from the keyboard, but it does not show them well.
EDIT 2:
I have tryed too setlocale(LC_ALL, "");, setlocale(LC_ALL, "es_ES.UTF-8"); and setlocale(LC_ALL, "es_ES.ISO_8859-15"); with the same results as EDIT 1 (or catch well characters from keyboard or show them well in console, but never both at the same time).

Microsoft's C runtime library (CRT) does not support UTF-8 as the locale encoding. It only supports Windows codepages. Also, "es_ES" isn't a valid CRT locale string, so setlocale would fail, leaving you in the default C locale. Newer versions of Microsoft's CRT support Windows locale names such as "es-ES" (hyphen, not underscore). Otherwise the CRT uses the full names or the old 3-letter abbreviations, e.g. "spanish_spain", "esp_esp" or "esp_esp.1252".
But that's not the end of the story. When reading from and writing to the console using legacy text encodings instead of Unicode, there's another layer of translation in the console itself. To avoid mojibake, you have to set the console input and output codepages (i.e. SetConsoleCP and SetConsoleOutputCP) to match the locale codepage. If you're limited to Spanish or Latin-1, then it should work to set the locale to "spanish" and set the console codepages via SetConsoleCP(1252) and SetConsoleOutputCP(1252). More generally you could look up the ANSI codepage for a given locale name, set the console codepages, and save them in order to reset the console at exit. For example:
wchar_t *locale_name = L"es-ES";
if (_wsetlocale(LC_ALL, locale_name)) {
int codepage;
gPrevConsoleCP = GetConsoleCP();
if (gPrevConsoleCP) { // The process is attached to a console.
gPrevConsoleOutputCP = GetConsoleOutputCP();
if (GetLocaleInfoEx(locale_name,
LOCALE_IDEFAULTANSICODEPAGE |
LOCALE_RETURN_NUMBER,
(LPWSTR)&codepage,
sizeof(codepage) / sizeof(wchar_t))) {
if (!codepage) { // The locale doesn't have an ANSI codepage.
codepage = GetACP();
}
SetConsoleCP(codepage);
SetConsoleOutputCP(codepage);
atexit(reset_console);
}
}
}
That said, when working with the console you will be better off in general if you set stdin and stdout to use _O_U16TEXT mode and use wide-character functions such as fgetws and wprintf. Ultimately, if supported by the C runtime library, this should use the wide-character console I/O functions ReadConsoleW and WriteConsoleW. The downside of using UTF-16 wide-character mode is that it would entail a complete rewrite of your code to use wchar_t strings and wide-character functions and also would require implementing adapters for libraries that work with multibyte encoded strings (preferably UTF-8).

Related

What locale LC_CTYPE is used for Windows unicode console app?

While converting a multi-byte console application to Unicode, I ran up against a weird problem where _tcprintf and WriteConsole worked fine but _tprintf was printing the wrong characters...
I've traced it back to using setlocale(LC_ALL, "C") which uses LC_CTYPE of 1 byte based on MS doc:
The C locale assumes that all char data types are 1 byte and that their value is always less than 256.
However, I want to keep "C" for everything except the LC_CTYPE but I don't know what to use?
I thought the whole point of using UTF16 is that all the characters are available and things would print properly no matter the code page or locale.
Although it also appears setting the console output to UTF-8 (65001) (SetConsoleCP which of course is separate from the locale) in a Unicode app and outputting UTF16 also has problems displaying the correct characters.
Anyway, does anyone know what value I should be using the LC_CTYPE for UTF16 on Windows Unicode Console Application? Maybe it's as easy as setlocale( LC_CTYPE, "" ); ? TIA!!
Use _setmode() to set the file translation mode to _O_U16TEXT:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void)
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"ελληνικά\n");
}

scanf-printf call messes non-ASCII chars after locale is set

I wonder why non-ASCII localed input-output fails:
setlocale(LC_ALL,"");
scanf("%s",buffer); // I type "příšerně"
printf("%s",buffer); // I get "pýˇçernŘ"
The locale is Czech_Czech Republic.1250 and all the non-ASCII chars (říšě) are in CP1250. Why it fails? The reference says
In <cstdio> (<stdio.h>), formatted input/output operations are
affected by character transformation rules.
Using the default "C" locale gives correct output. How to fix it? On Windows I can't use UTF-8 in setlocale
If you provide a code page value of UTF-7 or UTF-8, setlocale will
fail, returning NULL.
In my project I use setlocale to read UTF8 text file and to display it on console using WinAPI MultiByteToWideChar function, but that requires system default locale, so I need to set the locale.
edit: I just found the input is in CP852, which is the default in "C". I suppose I could use iconv, but I'd rather convince scanf not to stay with CP852.
After 3 hours of testing, I finaly got the working solution. It might not work for everybody since there is still a little mystery behind it. So this helped:
setlocale(LC_CTYPE,"Czech_Czech Republic.852");
CP852 was the default console codepage for Central Europe since DOS times. There is also chcp DOS command and SetConsoleCP and SetconsoleOutputCP winAPI functions. For some reason, this still messes the output:
setlocale(LC_CTYPE,"Czech_Czech Republic.1250");
SetConsoleCP(1250);
SetConsoleOutputCP(1250);
...but this is OK
setlocale(LC_CTYPE,"Czech_Czech Republic.852");
SetConsoleCP(852); // default in CE win console
SetConsoleOutputCP(852); // default in CE win console
Note that UTF-8 can't be set in setlocale, see the original question.

strange extended characters in ncurses

I'm writing an ncurses application, and there's a strange issue with how special characters are printed to the screen. Here's an example:
#include <ncurses.h>
int main(int argc, char *argv[])
{
initscr();
noecho();
keypad(stdscr, TRUE);
cbreak();
curs_set(0);
addch(ACS_LARROW);
addch(' ');
addch(ACS_UARROW);
addch(' ');
addch(ACS_DARROW);
addch(' ');
addch(ACS_RARROW);
addch(' ');
refresh();
getch();
endwin();
return 0;
}
So, when I run this on a tty, the characters are correctly printed as arrows (←, ↑, ↓, →), but when I try and run this on a terminal (I've tried on gnome-terminal and LXTerminal) this is the output:
< ^ v >
Is there any reason for this difference? I thought it might be font related, but I'm really out of my territory here, and my googling didn't help.
Any suggestion on how to force lxterminal (or any other terminal) to output the same characters of the tty?
ncurses decides which character to output for ACS_LARROW and friends based on your termtype. It's likely that in a tty your termtype is set to 'linux', whereas in gnome-terminal, etc it'll most likely be set to 'xterm'. Although I'm not certain, it's quite possible that the xterm termtype doesn't support these characters.
You could try running you application as so:
env TERM=linux ./a.out
Other termtypes to try are gnome, rxvt and vt102. These will output extended ASCII characters and your terminal emulator should support them. You could also try rxvt-unicode if you have it installed, that should output the correct unicode codepoints for these special symbols.
terminal emulations are sometimes coded in 7bit ASCII, and there is no corresponding Value for the arrows (with line and wedge) in this code page, so the terminal displays what comes near (that is: only the wedge).
on tty you have all the capacities of you computer (UTF-8, color encoding, ...), so the terminal can draw the arrow.
The setting of TERM is largely irrelevant. What matters is
the locale
the choice of library
the particular terminal emulator
The usual reason for this is that the program is built-with and linked-to the ncurses library rather than ncursesw:
if you use the former (ncurses) in a locale using UTF-8, ncurses will use ASCII graphics simply because it cannot draw UTF-8. ncursesw can draw using UTF-8 characters.
using ncursesw, the program will use UTF-8 as needed (since 2002, it can use either the terminfo information, or a built-in table of Unicode values). The Linux console and screen are known special cases where the terminfo cannot describe the behavior of the terminal. For others (terminals which do not support VT100 line-drawing in UTF-8 mode), you would set NCURSES_NO_UTF8_ACS.
the terminal database does in fact have an extended capability addressing the lack of VT100 compatibility vs UTF-8 (see U8 note in the terminal database). However, most people set TERM to xterm or linux and often equate those (see note in xterm FAQ).

wprintf with UNICODE (Hebrew) characters

I have a wchar_t array with English and Hebrew characters and when I print it with wprintf() it prints to console the English characters only. When I'm using _wsetlocale( LC_ALL, L"Hebrew" ) I get the Hebrew characters as "????".
The machine I'm working on supports Hebrew of course.
BTW - using c:\windows\system32\cmd.exe and 'dir' on a directory with Hebrew characters, also shows "???" instead of Hebrew.
Any idea?
Have you confirmed that your console font can handle unicode characters? Most don't. You might try the Consolas font.
When I've run into this before, I've found this article by Michael Kaplan to be extremely helpful.
Basically Microsoft's C runtime library isn't implemented very well to allow this.
You can do _setmode(_fileno(stdout), _O_U16TEXT); and then writing with wcout or wprintf will work. However trying to use cout or printf, or anything that doesn't write UTF-16 will then cause the program to crash.

Print Arabic text in Vala

I tried
print ("السلام عليكم\n");
it outputs
?????? ?????
After looking at the generated c code
...
g_print ("السلام عليكم\n");
...
it appears that they're using g_print() which it is not doing that same as printf() in C which works perfectly fine with Arabic.
So, is there anyway to print arabic text in Vala?
Just add this to the start of your code:
Intl.setlocale (LocaleCategory.ALL, "");
By leaving the second parameter an empty string you're loading the LOCALE that the current user has set (which is likely to be a UTF-8 based one on modern Linux systems).
Windows is a different story here ...
See also:
https://valadoc.org/glib-2.0/GLib.Intl.setlocale.html
printing utf8 in glib
https://en.wikipedia.org/wiki/C_localization_functions
http://en.cppreference.com/w/c/locale/setlocale

Resources