Why does printing this wide character string crash on windows?

Why does printing this wide character string crash on windows? - c

I stumbled upon a problem while going through some unit tests, and I am not entirely sure why the following simple example crashes on the line with sprintf (Using Windows with Visual Studio 2019).
#include <stdio.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "en_US.utf8");
char output[255];
sprintf(output, "simple %ls text", L"\u00df\U0001d10b");
return 0;
}
Is there something wrong with the code?

char is 8-bit and wchar_t is 16-bit. When you try to convert the two, you will have to use functions like MultiByteToWideChar to convert between the two.
When you try to use Unicode strings in a multi-byte function, it causes buffer overflow, which might be the cause of your crashes.
Try using swprintf_s instead.

Related

Multi-platform Unicode handling based on char* in C without using 3rd party libraries?

The following are bare minimum examples (I know that e.g. UNICODE/_UNICODE should be defined) that I've found to work:
Linux:
#include <stdio.h>
int main() {
char* str = "Rölf";
printf("%s\n", str);
}
Windows:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
wprintf(L"%s\n", str);
}
Now, I've read that one way of going about it is to basically "just use UTF-8/char everywhere and worry about platform-specific conversion when you do API calls".
And that would be great - have users provide char* as input for my library and "simply" convert that. So I've tried the following snippet based on this example (I've also seen it in variations elsewhere). If this would actually work, it would be amazing. But it doesn't:
char* str = u8"Rölf";
int len = mbstowcs(NULL, str, 0) + 1;
wchar_t wstr[len];
mbstowcs(wstr, str, len);
wprintf(L"%s\n", wstr);
I've also stumbled across discussions about console fonts and whatnot being the cause of faulty rendering, so to demonstrate that this is not a console issue - the following doesn't work either (well - the L"" literal does. The converted u8 literal doesn't):
MessageBoxW(NULL, wstr, L"Rölf", MB_OK);
Am I misunderstanding the conversion process? Is there a way to make to this work? (Without using e.g. ICU)

The mbstowcs function converts from a string encoded in the current locale's encoding to wchar_t[], not from UTF-8 (unless that encoding is UTF-8). On post-April-beta-2018 versions of Windows 10 or later, you actually can fix Windows to use UTF-8 as the encoding for plain char[] strings either as a global setting, or presumably by calling _setmbcp(65001). Older versions of Windows explicitly forbid this however for dubious historical reasons.
Anyway, you second version of the code which you called "Windows" should work on arbitrary systems if not for a bug in MSVC's wprintf that you worked around: they have the meanings of %ls and %s backwards for the wide stdio functions. In standard C, you need %ls to format a wchar_t[] string. But there's actually no reason to use wprintf there at all, and in fact wprintf is highly problematic because you can't mix it with byte-oriented stdio (doing so invokes undefined behavior). So better would be:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
printf("%ls\n", str);
}
and this version should work correctly on Windows and standards-conforming C implementations, since for the byte-oriented printf functions, MSVC doesn't have the meaning of %s and %ls reversed.
If you really want to, you can also use a variant of your third version of the code, but you can't use mbstowcs to convert from UTF-8 to wchar_t. Instead you need to either:
Assume wchar_t is Unicode-encoded, and convert from UTF-8 to Unicode codepoints with your own (or a third-party library's) UTF-8 decoder. But this is a bad assumption, because MSVC is also non-conforming in that it uses UTF-16 for wchar_t (C explicitly forbids "multi-wchar_t-characters because the mb/wc APIs are inherently incompatible with them), not Unicode codepoint values (equivalent to UTF-32).
Convert from UTF-8 to uchar32_t (UTF-32) with your own (or a third-party library's) UTF-8 decoder, then use c32rtomb to convert to wchar_t[].
Use iconv (standard on POSIX systems; available as a third-party library on Windows) to convert directly from UTF-8 to wchar_t.
UTF8 option for Windows 10, version 1803+

Thanks to Barmak Shemirani making me aware of MultiByteToWideChar, I've found a solution to this that is even C99 conform. (Which works on Windows 7 by the way)
Note that setlocale() is only necessary for console output to render correctly. I didn't use it to highlight that it doesn't seem to be needed for GUI-related API calls.
#define UNICODE
#define _UNICODE
#include <stdio.h>
#include <windows.h>
//#include <locale.h>
wchar_t* toWide(char* str) {
int wchars_num = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);
wchar_t* wstr = (wchar_t*)malloc(sizeof(wchar_t) * wchars_num);
MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wchars_num);
return wstr;
}
int main() {
// For output in console to render correctly - as far as the font allows anyway...
//setlocale(LC_ALL, "");
// PLATFORM-AGNOSTIC DATA STRUCTURE WITH UTF-8 TEXT
// (Usually not directly next to the platform-specific API calls...)
char* str = "Rölf";
// PLATFORM-SPECIFIC TEXT HANDLING
wchar_t* wstr = toWide(str);
printf("%ls\n", wstr);
MessageBox(NULL, wstr, L"Rölf", MB_OK);
free(wstr);
}
The way I use it is that I declare a data structure to be filled by my users where all text is char* and assumed to be UTF-8. Then in my library, I use platform-specific UI APIs. And in the case of Windows, doing the above UTF-16 conversion is obviously necessary.

ASCII characters in C

I'm trying to save a character from the cyrillic alphabet in a char.
When I take a string from the console it saves it in the char array successfully but just initializing it doesn't seem to work. I get "programName.exe has stopped working" when trying to run it.
#include <stdio.h>
#include <conio.h>
#include <string.h>
#include <Windows.h>
#include <stdlib.h>
void test(){
char test = 'Я';
printf("%s",test);
}
void main(){
SetConsoleOutputCP(1251);
SetConsoleCP(1251);
test();
}
fgets ( books[booksCount].bookTitle, 80, stdin ); // this seems to be working ok with ascii.
I tried using wchar_t but I get the same results.

If you're using Russian Windows which uses Windows-1251 codepage by default, you can print the character encoded as a single byte using the old printf but you need to make sure that the source code uses the same cp1251 charset. Don't save as Unicode.
But the preferred way should be using wprintf with wide char string
void test() {
wchar_t test_char = L'Я';
wchar_t *test_string = L"АБВГ"; // or LPCWSTR test_string
wprintf(L"%c\n%s", test_char, test_string);
}
This time you need to save the file as Unicode (UTF-8 or UTF-16)
UTF-8 may be better, but it's trickier on Windows. Moreover if you use UTF-8 you cannot use a char to store Я because it needs more than 1 byte. You must use a char* instead
Note that main must return int, not void, and the above fgets must be called from inside some function

This could be solved, doing
void test()
{
char test = 'Я';
putchar(test);
}
But there is a catch: Since 'Я' is not an ASCII character, you might need to set appropriate locale before.
Moreover, only ASCII characters 32 - 126 are guaranteed to be printable, and the same symbol, on all systems.

How to find the built-in function to deal with char16_t in C?

Please tell what is the char16_t version for the String Manipulation Functions
such as:
http://www.tutorialspoint.com/ansi_c/c_function_references.htm
I found many references site, but no one mentioned that.
Especially for printing function, this is that most important, because it help me to verify whether the Manipulation function is work.
#include <stdio.h>
#include <uchar.h>
char16_t *u=u"α";
int main(int argc, char *argv[])
{
printf("%x\n",u[0]); // output 3b1, it is UTF16
wprintf("%s\n",u); //no ouput
_cwprintf("%s\n",u); //incorrect output
return 0;
}

To print/read/open write etc.., you need to convert to 32-bit chars using the mbsrtowcs function.
For ALL intents and purposes, char16_t is a multi-byte representation, therefore, one need use mbr functions to work with this integral type.
A few answers used the L"prefix" which is completely incorrect. 16-bit strings require the u"prefix".
The following code gets you everything you need to work with 8, 16, and 32-bit string representations.
#include <string.h>
#include <wchar.h>
#include <uchar.h>
You can Google the procedures found in <wchar.h> if you don't have manual pages (UNIX).
Gnome.org's GLib has some great code for you to drop-in if overhead isn't an issue.
char16_t and char32_t are ISO C11 (iso9899:2011) extensions.

wprintf and its wchar colleagues need to have th format string in wchar too:
wprintf( L"%s\n", u);
For wchar L is used as a prefix to the string literals.
Edit:
Here's a code snippet (tested on Windows):
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <wchar.h>
void main()
{
wchar_t* a = L"α";
fflush(stdout); //must be done before _setmode
_setmode(_fileno(stdout), _O_U16TEXT); // set console mode to unicode
wprintf(L"alpha is:\n\t%s\n", a); // works for me :)
}
The console doesn't work in unicode and prints a "?" for non ascii chars. In Linux you need to remove the underscore prefix before setmode and fileno.
Note: for windows GUI prints, there already proper support, so you can use wsprintf to format unicode strings.

Eclipse error by debugging strlen

This program crashes at the point i=(strlen(data)); with the message
No source available for "strlen() "
But Why?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main (void) {
char data[]="Hallo";
char buffer[100];
if (strlen(data)!=0)
{
size_t i=0;
i=(strlen(data));
snprintf(buffer,i,"Data: %s \n",data);
return strlen(data)+1;
}
return -1;
}

The error message you cite does not sound like a crash. More like a debugger trying to step into a system library function.

I suspect the cause of the problem is
snprintf(buffer,i,"Data: %s \n",data);
The i here is the "buffer size". i is also the length of data. So you're writing a string to a buffer which is longer than the buffer size. The effect is that snprintf() truncates the output, so not the entire data string will be written.
In fact, Data: is six characters long, that's longer than i (5). So maybe what's happening is that snprintf never makes use of the %s modified, which somehow breaks the stack?
Try replacing i with sizeof(buffer) and see whether that works better.

I just ran this program in Eclipse, and it works fine. It sounds like you are stepping through the code line-by-line and when you get to the strlen call you do a "Step-Into"(F5) instead of "Step Over"(F6). So Eclipse is trying to debug strlen.
Either way, this is an Eclipse issue and I suggest you add an Eclipse tag to the question.

C Wide characters - how to use them?

I'm able to output a single character using this code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
main(){
setlocale(LC_CTYPE, "");
wchar_t a = L'Ö';
putwchar(a);
}
How can I adapt the code to output a string?
Something like
wchar_t *a = L"ÖÜÄöüä";
wprinf("%ls", a);

wprintf(L"%ls", str)

It's a bit tricky, you have to know what your internal wchar_ts mean. (See here for a little discussion.) Basically you should communicate with the environment via mbstowcs/wcstombs, and with data with known encoding via iconv (converting from and to WCHAR_T).
(The exception here is Windows, where you can't really communicate with the environment meaningfully, but you can access it in a wide version directly with Windows API functions, and you can write wide strings directly into message boxes etc.)
That said, once you have your internal wide string, you can convert it to the environment's multibyte string with wcstombs, or you can just use printf("%ls", mywstr); which performs the conversion for you. Just don't forget to call setlocale(LC_CTYPE, "") at the very beginning of your program.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does printing this wide character string crash on windows? - c

Related

Multi-platform Unicode handling based on char* in C without using 3rd party libraries?

ASCII characters in C

How to find the built-in function to deal with char16_t in C?

Eclipse error by debugging strlen

C Wide characters - how to use them?

Categories

Resources