scanf-printf call messes non-ASCII chars after locale is set - c

I wonder why non-ASCII localed input-output fails:
setlocale(LC_ALL,"");
scanf("%s",buffer); // I type "příšerně"
printf("%s",buffer); // I get "pýˇçernŘ"
The locale is Czech_Czech Republic.1250 and all the non-ASCII chars (říšě) are in CP1250. Why it fails? The reference says
In <cstdio> (<stdio.h>), formatted input/output operations are
affected by character transformation rules.
Using the default "C" locale gives correct output. How to fix it? On Windows I can't use UTF-8 in setlocale
If you provide a code page value of UTF-7 or UTF-8, setlocale will
fail, returning NULL.
In my project I use setlocale to read UTF8 text file and to display it on console using WinAPI MultiByteToWideChar function, but that requires system default locale, so I need to set the locale.
edit: I just found the input is in CP852, which is the default in "C". I suppose I could use iconv, but I'd rather convince scanf not to stay with CP852.

After 3 hours of testing, I finaly got the working solution. It might not work for everybody since there is still a little mystery behind it. So this helped:
setlocale(LC_CTYPE,"Czech_Czech Republic.852");
CP852 was the default console codepage for Central Europe since DOS times. There is also chcp DOS command and SetConsoleCP and SetconsoleOutputCP winAPI functions. For some reason, this still messes the output:
setlocale(LC_CTYPE,"Czech_Czech Republic.1250");
SetConsoleCP(1250);
SetConsoleOutputCP(1250);
...but this is OK
setlocale(LC_CTYPE,"Czech_Czech Republic.852");
SetConsoleCP(852); // default in CE win console
SetConsoleOutputCP(852); // default in CE win console
Note that UTF-8 can't be set in setlocale, see the original question.

Related

What locale LC_CTYPE is used for Windows unicode console app?

While converting a multi-byte console application to Unicode, I ran up against a weird problem where _tcprintf and WriteConsole worked fine but _tprintf was printing the wrong characters...
I've traced it back to using setlocale(LC_ALL, "C") which uses LC_CTYPE of 1 byte based on MS doc:
The C locale assumes that all char data types are 1 byte and that their value is always less than 256.
However, I want to keep "C" for everything except the LC_CTYPE but I don't know what to use?
I thought the whole point of using UTF16 is that all the characters are available and things would print properly no matter the code page or locale.
Although it also appears setting the console output to UTF-8 (65001) (SetConsoleCP which of course is separate from the locale) in a Unicode app and outputting UTF16 also has problems displaying the correct characters.
Anyway, does anyone know what value I should be using the LC_CTYPE for UTF16 on Windows Unicode Console Application? Maybe it's as easy as setlocale( LC_CTYPE, "" ); ? TIA!!
Use _setmode() to set the file translation mode to _O_U16TEXT:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void)
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"ελληνικά\n");
}

Read / Write special characters (like tilde, ñ,...) in a console application C

I'm trying that a C console application can read (using the keyboard) special Spanish characters such as accents, 'ñ', etc in a scanf or gets and then, print it too with printf.
I have achieved to show these characters correctly (stored in a variable or, directly, from printf) thanks to the package locale.h. I show an example:
#include <stdio.h>
// Add languaje package
#include <locale.h>
int main(void)
{
char string[254];
// Set languaje to Spanish
setlocale(LC_ALL, "spanish");
// Show correctly spanish special chars
printf("¡Success!. It is shown special chars like 'ñ' or 'á'.\n\n\n");
// Gets special chars by keyboard
printf("Input spanish special chars (such 'ñ'): ");
gets(string);
printf("Your string is: %s", string);
return 0;
}
but I have not yet achieved to pick them up correctly with the functions mentioned above.
Does anyone know how to do it?
Thank you.
EDIT 1:
In testing, I observed that:
setlocale(LC_ALL, "spanish"); It shows the characters of the Spanish correctly, but it does not collect them from the keyboard.
setlocale(LC_ALL, "es_ES"); It picks up the Spanish characters correctly from the keyboard, but it does not show them well.
EDIT 2:
I have tryed too setlocale(LC_ALL, "");, setlocale(LC_ALL, "es_ES.UTF-8"); and setlocale(LC_ALL, "es_ES.ISO_8859-15"); with the same results as EDIT 1 (or catch well characters from keyboard or show them well in console, but never both at the same time).
Microsoft's C runtime library (CRT) does not support UTF-8 as the locale encoding. It only supports Windows codepages. Also, "es_ES" isn't a valid CRT locale string, so setlocale would fail, leaving you in the default C locale. Newer versions of Microsoft's CRT support Windows locale names such as "es-ES" (hyphen, not underscore). Otherwise the CRT uses the full names or the old 3-letter abbreviations, e.g. "spanish_spain", "esp_esp" or "esp_esp.1252".
But that's not the end of the story. When reading from and writing to the console using legacy text encodings instead of Unicode, there's another layer of translation in the console itself. To avoid mojibake, you have to set the console input and output codepages (i.e. SetConsoleCP and SetConsoleOutputCP) to match the locale codepage. If you're limited to Spanish or Latin-1, then it should work to set the locale to "spanish" and set the console codepages via SetConsoleCP(1252) and SetConsoleOutputCP(1252). More generally you could look up the ANSI codepage for a given locale name, set the console codepages, and save them in order to reset the console at exit. For example:
wchar_t *locale_name = L"es-ES";
if (_wsetlocale(LC_ALL, locale_name)) {
int codepage;
gPrevConsoleCP = GetConsoleCP();
if (gPrevConsoleCP) { // The process is attached to a console.
gPrevConsoleOutputCP = GetConsoleOutputCP();
if (GetLocaleInfoEx(locale_name,
LOCALE_IDEFAULTANSICODEPAGE |
LOCALE_RETURN_NUMBER,
(LPWSTR)&codepage,
sizeof(codepage) / sizeof(wchar_t))) {
if (!codepage) { // The locale doesn't have an ANSI codepage.
codepage = GetACP();
}
SetConsoleCP(codepage);
SetConsoleOutputCP(codepage);
atexit(reset_console);
}
}
}
That said, when working with the console you will be better off in general if you set stdin and stdout to use _O_U16TEXT mode and use wide-character functions such as fgetws and wprintf. Ultimately, if supported by the C runtime library, this should use the wide-character console I/O functions ReadConsoleW and WriteConsoleW. The downside of using UTF-16 wide-character mode is that it would entail a complete rewrite of your code to use wchar_t strings and wide-character functions and also would require implementing adapters for libraries that work with multibyte encoded strings (preferably UTF-8).

C Programming - ascii for windows "unknown" characters

I'm programming in windows, but in my C console some characters (like é, à, ã) are not recognizable. I would like to see how can I make widows interpret those chars as using unicode in the console or utf-8.
I would be glad for some enlightening.
Thank you very much
By console do you mean cmd.exe? It doesn't handle Unicode well, but you can get it to display "ANSI" characters by changing the display font to Lucida Console and changing the code page from "OEM" to "ANSI." By the choice of characters you seem to be Western European, so try giving this command before running your application:
chcp 1252
If you want to try your luck with UTF-8 output use chcp 65001 instead.
Although I completely agree with Joni's answer, I think it can be added a detail:
Since Telmo Vaz asked about how to solve this problem for C programs, we can consider the alternative of adding a system command inside the code:
#include <stdlib.h> // To use the function system();
#include <stdio.h>
int main(void) {
system("CHCP 1252");
printf("Now accents are right: áéíüñÇ \n");
return 0;
}
EDIT It is a good idea to do some experiments with codepages. Check the following table for information (under Windows):
Windows Codepages

wprintf with UNICODE (Hebrew) characters

I have a wchar_t array with English and Hebrew characters and when I print it with wprintf() it prints to console the English characters only. When I'm using _wsetlocale( LC_ALL, L"Hebrew" ) I get the Hebrew characters as "????".
The machine I'm working on supports Hebrew of course.
BTW - using c:\windows\system32\cmd.exe and 'dir' on a directory with Hebrew characters, also shows "???" instead of Hebrew.
Any idea?
Have you confirmed that your console font can handle unicode characters? Most don't. You might try the Consolas font.
When I've run into this before, I've found this article by Michael Kaplan to be extremely helpful.
Basically Microsoft's C runtime library isn't implemented very well to allow this.
You can do _setmode(_fileno(stdout), _O_U16TEXT); and then writing with wcout or wprintf will work. However trying to use cout or printf, or anything that doesn't write UTF-16 will then cause the program to crash.

Print Arabic text in Vala

I tried
print ("السلام عليكم\n");
it outputs
?????? ?????
After looking at the generated c code
...
g_print ("السلام عليكم\n");
...
it appears that they're using g_print() which it is not doing that same as printf() in C which works perfectly fine with Arabic.
So, is there anyway to print arabic text in Vala?
Just add this to the start of your code:
Intl.setlocale (LocaleCategory.ALL, "");
By leaving the second parameter an empty string you're loading the LOCALE that the current user has set (which is likely to be a UTF-8 based one on modern Linux systems).
Windows is a different story here ...
See also:
https://valadoc.org/glib-2.0/GLib.Intl.setlocale.html
printing utf8 in glib
https://en.wikipedia.org/wiki/C_localization_functions
http://en.cppreference.com/w/c/locale/setlocale

Resources