I am trying to format wchar_t* with UTF-8 characters using vsnprintf and then printing the buffer using printf.
Given the following code:
/*
This code is modified version of KB sample:
https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm
The usage of `setlocale` is required by my real-world scenario,
but can be modified if that fixes the issue.
*/
#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>
#ifdef MSVC
#include <windows.h>
#endif
void vout(char *string, char *fmt, ...)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
va_list arg_ptr;
va_start(arg_ptr, fmt);
vsnprintf(string, 100, fmt, arg_ptr);
va_end(arg_ptr);
}
int main(void)
{
setlocale(LC_ALL, "");
#ifdef MSVC
SetConsoleOutputCP(65001); // with or without; no dice
#endif
char string[100];
wchar_t arr[] = { 0x0119 };
vout(string, "%ls", arr);
printf("This string should have 'ę' (e with ogonek / tail) after colon: %s\n", string);
return 0;
}
I compiled with gcc v5.4 on Ubuntu 16 to get the desired output in BASH:
gcc test.c -o test_vsn
./test_vsn
This string should have 'ę' (e with ogonek / tail) after colon: ę
However, on Windows 10 with CL v19.10.25019 (VS 2017), I get weird output in CMD:
cl test.c /Fetest_vsn /utf-8
.\test_vsn
This string should have 'T' (e with ogonek / tail) after colon: e
(the ę before colon becomes T and after the colon is e without ogonek)
Note that I used CL's new /utf-8 switch (introduced in VS 2015), which apparently has no effect with or without. Based on their blog post:
There is also a /utf-8 option that is a synonym for setting “/source-charset:utf-8” and “/execution-charset:utf-8”.
(my source file already has BOM / utf8'ness and execution-charset is apparently not helping)
What could be the minimal amount of changes to the code / compiler switches to make the output look identical to that of gcc?
Based on #RemyLebeau's comment, I modified the code to use w variant of the printf APIs to get the output identical with msvc on Windows, matching that of gcc on Unix.
Additionally, instead of changing codepage, I have now used _setmode (FILE translation mode).
/*
This code is modified version of KB sample:
https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm
The usage of `setlocale` is required by my real-world scenario,
but can be modified if that fixes the issue.
*/
#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>
#ifdef _WIN32
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
#endif
void vout(wchar_t *string, wchar_t *fmt, ...)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
va_list arg_ptr;
va_start(arg_ptr, fmt);
vswprintf(string, 100, fmt, arg_ptr);
va_end(arg_ptr);
}
int main(void)
{
setlocale(LC_ALL, "");
#ifdef _WIN32
int oldmode = _setmode(_fileno(stdout), _O_U16TEXT);
#endif
wchar_t string[100];
wchar_t arr[] = { 0x0119, L'\0' };
vout(string, L"%ls", arr);
wprintf(L"This string should have 'ę' (e with ogonek / tail) after colon: %ls\r\n", string);
#ifdef _WIN32
_setmode(_fileno(stdout), oldmode);
#endif
return 0;
}
Alternatively, we can use fwprintf and provide stdout as first argument. To do the same with fwprintf(stderr,format,args) (or perror(format, args)), we would need to _setmode the stderr as well.
Related
The following program can be compiled using msvc or mingw. However, the mingw version cannot display unicode correctly. Why? How can I fix that?
Code:
#include <stdio.h>
#include <windows.h>
#include <io.h>
#include <fcntl.h>
int wmain(void)
{
_setmode(_fileno(stdout), _O_U16TEXT);
_putws(L"哈哈哈");
system("pause");
return 0;
}
Mingw64 Compile Command:
i686-w64-mingw32-gcc -mconsole -municode play.c
MSVC Compiled:
Mingw Compiled:
Edit:
After some testing, the problem seems not causing by mingw. If I run the program directly by double clicking the app. The unicode string cannot be displayed correct either. The code page however, is the same, 437.
It turns out the problem is related to console font instead of the compiler. See the following demo code for changing console font.
This is happening because of missing #define UNICODE & #define _UNICODE . You should try adding it along with other headers. The _UNICODE symbol is used with headers such as tchar.h to direct standard C functions such as printf() and fopen() to the Unicode versions.
Please Note - The -municode option is still required when linking if Unicode mode is used.
After doing some research, it turns out the default console font does not support chainese glyphs. One can change the console font by using SetCurrentConsoleFontEx function.
Demo Code:
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <windows.h>
#define FF_SIMHEI 54
int main(int argc, char const *argv[])
{
CONSOLE_FONT_INFOEX cfi = {0};
cfi.cbSize = sizeof(CONSOLE_FONT_INFOEX);
cfi.nFont = 0;
cfi.dwFontSize.X = 8;
cfi.dwFontSize.Y = 16;
cfi.FontFamily = FF_SIMHEI;
cfi.FontWeight = FW_NORMAL;
wcscpy(cfi.FaceName, L"SimHei");
SetCurrentConsoleFontEx(GetStdHandle(STD_OUTPUT_HANDLE), FALSE, &cfi);
/* UTF-8 String */
SetConsoleOutputCP(CP_UTF8); /* Thanks for Eryk Sun's notice: Remove this line if you are using windows 7 or 8 */
puts(u8"UTF-8你好");
/* UTF-16 String */
_setmode(_fileno(stdout), _O_U16TEXT);
_putws(L"UTF-16你好");
system("pause");
return 0;
}
I am still a rookie with C, and even newer to wide chars in C.
The below code should show
4 points to Smurfs
but it shows
4 points to Smurfs
In gdb I see this:
(gdb) p buffer
$1 = L" 4 points to Smurfs",
But when I copy paste from the console, the spaces are magically gone:
(gdb) p buffer
$1 = L"4 points to Smurfs",
Also, buffer[0] contains this according to gdb:
65279 L' '
Apparently the character in question  is the Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF). I retyped the code making sure I did not enter this. I don't know where this comes from. I also opened the code in notepad per https://stackoverflow.com/a/9691839/7602 and there is no extra chars there.
I wouldn't care if ncurses would stop showing this as a space.
Code (heavily cut down):
#include <time.h>
#include <stdio.h>
#include <errno.h>
#include <wchar.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <locale.h>
#define NCURSES_WIDECHAR 1
#include <ncursesw/ncurses.h>
#include "types.h"
#include "defines.h"
#include "externs.h"
WINDOW * term;
/*row column color n arguments */
void rccn(int row, int col, const wchar_t *fmt, ...)
{
wchar_t buffer[80];
int size;
va_list args;
va_start(args, fmt);
size = vswprintf(buffer, 80, fmt, args);
va_end( args );
if(size >= 80){
mvaddwstr(row, col, L"Possible hacker detected!");
}else{
mvaddwstr(row, col, buffer);
}
}
int main(void)
{
int ch;
setlocale(LC_ALL,"");
term = initscr();
rccn(1,1,L"%i points to %ls",4,L"Smurfs");
ch = getch();
return EXIT_SUCCESS;
}
The problem goes 'away' with
rccn(1,1,L"%i points to %ls",4,L"Smurfs"+1);
As if the wide encoding of the constant adds that char in front..
Found it..
I had followed a tutorial where it was advised to add this compiler flag:
-fwide-exec-charset=utf-32
My code was not running on Cygwin at all, and I read that Windows is utf-16 centered, so I removed that compiler flag and it started working on Cygwin.
Then out of curiosity I removed the compiler flag on Raspbian, and it is now working as expected there as well, no more byte order marks.
I have seen a few questions on how to print these characters but none of the methods appear to be working. I suspect it is because I making a Win32 console application based on some of the comments I read.
Here is an example of what I have tried in my code currently. It only prints question mark boxes, or if I change it around I get question marks or random symbols.
I have tried defining these at the top.
#define SPADE '\x06'
#define CLUB '\x05'
#define HEART '\x03'
#define DIAMOND '\x04'
inside function, these are some of the things I've tried. I have left S,D,H,C in case I can't figure it out.
printf("%lc", SPADE);
//printf("♠");
//printf("S");
printf("%lc", HEART);
//printf("♥");
//printf("H");
printf("%lc", DIAMOND);
//printf("♦");
//printf("D");
printf("%lc", CLUB);
//printf("♣");
//printf("C");
UTF-16 wchar_t and wide characters functions are needed in Windows.
#include <windows.h>
int main()
{
DWORD n;
HANDLE hout = GetStdHandle(STD_OUTPUT_HANDLE);
const wchar_t *buf = L"♠♥♦♣\n";
WriteConsoleW(hout, buf, wcslen(buf), &n, 0);
return 0;
}
The following code will compile with Visual Studio:
#include <stdio.h>
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"♠♥♦♣\n");
return 0;
}
After setting the mode to UTF-16, you have to call _setmode(_fileno(stdout), _O_TEXT) if you wish to use printf again.
I have a simple code and argv[1] is "Привет".
#include <stdio.h>
#include <tchar.h>
#include <Windows.h>
#include <locale.h>
int _tmain(int argc, TCHAR* argv[])
{
TCHAR buf[100];
_fgetts(buf, 100, stdin);
_tprintf(TEXT("\nargv[1] %s\n"), argv[1]);
_tprintf(TEXT("%s\n"), buf);
}
In the console, I write "Мир" and have this result:
If I use setlocale(LC_ALL, ""), I have this result:
What should I do to get the correct string in both cases?
Evidently your program works, except it cannot print correctly on the console window. This is because Windows console is not fully compatible with Unicode. Use _setmode for Visual Studio. This should work for Russian but there could be additional problems with some Asian languages. Use WriteConsole for other compilers.
Visual Studio Example:
#include <stdio.h>
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
int wmain(int argc, wchar_t* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"%s", L"Привет\n");
return 0;
}
I was looking for ways to convert unicode codepoints to utf8.
So far, I've learned I can do it manually or use iconv.
I also thought wctomb would work, but it doesn't:
#include <stdio.h>
#include <stdlib.h>
#include <arpa/inet.h>
#define CENTER_UTF8 "\xf0\x9d\x8c\x86"
#define CENTER_UNICODE 0x1D306
int main(int argc, char** argv)
{
puts(CENTER_UTF8); //OK
static char buf[10];
int r;
#define WCTOMB(What) \
wctomb(NULL,0); \
r=wctomb(buf,What); \
puts(buf); \
printf("r=%d\n", r);
//Either one fails with -1
WCTOMB(CENTER_UNICODE);
WCTOMB(htonl(CENTER_UNICODE));
}
Could someone please explain to me why wctomb won't convert a unicode codepoint to utf8. I'm on Linux with a utf8 locale.
You should change program locale properly before using of wctomb():
#include <locale.h>
/* ... */
setlocale(LC_ALL, "");
This sets up program locale setting according to your environment. man setlocale
If locale is an empty string, "", each part of the locale that should
be modified is set according to the environment variables.
P.S. Actually LC_CTYPE is enough for wctomb().