I was looking for ways to convert unicode codepoints to utf8.
So far, I've learned I can do it manually or use iconv.
I also thought wctomb would work, but it doesn't:
#include <stdio.h>
#include <stdlib.h>
#include <arpa/inet.h>
#define CENTER_UTF8 "\xf0\x9d\x8c\x86"
#define CENTER_UNICODE 0x1D306
int main(int argc, char** argv)
{
puts(CENTER_UTF8); //OK
static char buf[10];
int r;
#define WCTOMB(What) \
wctomb(NULL,0); \
r=wctomb(buf,What); \
puts(buf); \
printf("r=%d\n", r);
//Either one fails with -1
WCTOMB(CENTER_UNICODE);
WCTOMB(htonl(CENTER_UNICODE));
}
Could someone please explain to me why wctomb won't convert a unicode codepoint to utf8. I'm on Linux with a utf8 locale.
You should change program locale properly before using of wctomb():
#include <locale.h>
/* ... */
setlocale(LC_ALL, "");
This sets up program locale setting according to your environment. man setlocale
If locale is an empty string, "", each part of the locale that should
be modified is set according to the environment variables.
P.S. Actually LC_CTYPE is enough for wctomb().
Related
I want to use unicode blocks in my C program to display them in the console like ▇, ░ and so on. However, whenever I try to use the escape sequence for unicode characters, I only get weird letters like:
printf("/u259A"); //259A is the unicode for ▚
Output: ÔûÜ
I looked up how to include unicode charactes then tried to use wchar_t:
#include <locale.h>
#include <stdio.h>
int main(int argc, char const *argv[]) {
setlocale(LC_ALL, "");
wchar_t c = "\u259A";
printf("%c",c);
return 0;
}
but that only gave me ☺ as the output instead of ▚. Removing setlocale() would give me a blank output. I dont know what do to from this point on. The only thing I saw was using printf("\xB2"); which gave you ▓. But I dont understand where the B2 comes from or what it stands for.
So what worked for me was the following:
#include <stdio.h>
#include <fcntl.h>
#include <io.h>
int main(int argc, char const *argv[]) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x2590 \x2554 \x258c \x2592"); //Output : ▐ ╔ ▌ ▒
return 0;
}
the function _setmode() is apparently for setting the console on u16 text encoding. wprintf() allows you to print wide characters (unicode aswell). The L"" before the string indicates to the compiler, that the following string is a unicode string. Thanks to everyone for their time and answers!
The following program can be compiled using msvc or mingw. However, the mingw version cannot display unicode correctly. Why? How can I fix that?
Code:
#include <stdio.h>
#include <windows.h>
#include <io.h>
#include <fcntl.h>
int wmain(void)
{
_setmode(_fileno(stdout), _O_U16TEXT);
_putws(L"哈哈哈");
system("pause");
return 0;
}
Mingw64 Compile Command:
i686-w64-mingw32-gcc -mconsole -municode play.c
MSVC Compiled:
Mingw Compiled:
Edit:
After some testing, the problem seems not causing by mingw. If I run the program directly by double clicking the app. The unicode string cannot be displayed correct either. The code page however, is the same, 437.
It turns out the problem is related to console font instead of the compiler. See the following demo code for changing console font.
This is happening because of missing #define UNICODE & #define _UNICODE . You should try adding it along with other headers. The _UNICODE symbol is used with headers such as tchar.h to direct standard C functions such as printf() and fopen() to the Unicode versions.
Please Note - The -municode option is still required when linking if Unicode mode is used.
After doing some research, it turns out the default console font does not support chainese glyphs. One can change the console font by using SetCurrentConsoleFontEx function.
Demo Code:
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <windows.h>
#define FF_SIMHEI 54
int main(int argc, char const *argv[])
{
CONSOLE_FONT_INFOEX cfi = {0};
cfi.cbSize = sizeof(CONSOLE_FONT_INFOEX);
cfi.nFont = 0;
cfi.dwFontSize.X = 8;
cfi.dwFontSize.Y = 16;
cfi.FontFamily = FF_SIMHEI;
cfi.FontWeight = FW_NORMAL;
wcscpy(cfi.FaceName, L"SimHei");
SetCurrentConsoleFontEx(GetStdHandle(STD_OUTPUT_HANDLE), FALSE, &cfi);
/* UTF-8 String */
SetConsoleOutputCP(CP_UTF8); /* Thanks for Eryk Sun's notice: Remove this line if you are using windows 7 or 8 */
puts(u8"UTF-8你好");
/* UTF-16 String */
_setmode(_fileno(stdout), _O_U16TEXT);
_putws(L"UTF-16你好");
system("pause");
return 0;
}
I have seen a few questions on how to print these characters but none of the methods appear to be working. I suspect it is because I making a Win32 console application based on some of the comments I read.
Here is an example of what I have tried in my code currently. It only prints question mark boxes, or if I change it around I get question marks or random symbols.
I have tried defining these at the top.
#define SPADE '\x06'
#define CLUB '\x05'
#define HEART '\x03'
#define DIAMOND '\x04'
inside function, these are some of the things I've tried. I have left S,D,H,C in case I can't figure it out.
printf("%lc", SPADE);
//printf("♠");
//printf("S");
printf("%lc", HEART);
//printf("♥");
//printf("H");
printf("%lc", DIAMOND);
//printf("♦");
//printf("D");
printf("%lc", CLUB);
//printf("♣");
//printf("C");
UTF-16 wchar_t and wide characters functions are needed in Windows.
#include <windows.h>
int main()
{
DWORD n;
HANDLE hout = GetStdHandle(STD_OUTPUT_HANDLE);
const wchar_t *buf = L"♠♥♦♣\n";
WriteConsoleW(hout, buf, wcslen(buf), &n, 0);
return 0;
}
The following code will compile with Visual Studio:
#include <stdio.h>
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"♠♥♦♣\n");
return 0;
}
After setting the mode to UTF-16, you have to call _setmode(_fileno(stdout), _O_TEXT) if you wish to use printf again.
I am trying to format wchar_t* with UTF-8 characters using vsnprintf and then printing the buffer using printf.
Given the following code:
/*
This code is modified version of KB sample:
https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm
The usage of `setlocale` is required by my real-world scenario,
but can be modified if that fixes the issue.
*/
#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>
#ifdef MSVC
#include <windows.h>
#endif
void vout(char *string, char *fmt, ...)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
va_list arg_ptr;
va_start(arg_ptr, fmt);
vsnprintf(string, 100, fmt, arg_ptr);
va_end(arg_ptr);
}
int main(void)
{
setlocale(LC_ALL, "");
#ifdef MSVC
SetConsoleOutputCP(65001); // with or without; no dice
#endif
char string[100];
wchar_t arr[] = { 0x0119 };
vout(string, "%ls", arr);
printf("This string should have 'ę' (e with ogonek / tail) after colon: %s\n", string);
return 0;
}
I compiled with gcc v5.4 on Ubuntu 16 to get the desired output in BASH:
gcc test.c -o test_vsn
./test_vsn
This string should have 'ę' (e with ogonek / tail) after colon: ę
However, on Windows 10 with CL v19.10.25019 (VS 2017), I get weird output in CMD:
cl test.c /Fetest_vsn /utf-8
.\test_vsn
This string should have 'T' (e with ogonek / tail) after colon: e
(the ę before colon becomes T and after the colon is e without ogonek)
Note that I used CL's new /utf-8 switch (introduced in VS 2015), which apparently has no effect with or without. Based on their blog post:
There is also a /utf-8 option that is a synonym for setting “/source-charset:utf-8” and “/execution-charset:utf-8”.
(my source file already has BOM / utf8'ness and execution-charset is apparently not helping)
What could be the minimal amount of changes to the code / compiler switches to make the output look identical to that of gcc?
Based on #RemyLebeau's comment, I modified the code to use w variant of the printf APIs to get the output identical with msvc on Windows, matching that of gcc on Unix.
Additionally, instead of changing codepage, I have now used _setmode (FILE translation mode).
/*
This code is modified version of KB sample:
https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm
The usage of `setlocale` is required by my real-world scenario,
but can be modified if that fixes the issue.
*/
#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>
#ifdef _WIN32
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
#endif
void vout(wchar_t *string, wchar_t *fmt, ...)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
va_list arg_ptr;
va_start(arg_ptr, fmt);
vswprintf(string, 100, fmt, arg_ptr);
va_end(arg_ptr);
}
int main(void)
{
setlocale(LC_ALL, "");
#ifdef _WIN32
int oldmode = _setmode(_fileno(stdout), _O_U16TEXT);
#endif
wchar_t string[100];
wchar_t arr[] = { 0x0119, L'\0' };
vout(string, L"%ls", arr);
wprintf(L"This string should have 'ę' (e with ogonek / tail) after colon: %ls\r\n", string);
#ifdef _WIN32
_setmode(_fileno(stdout), oldmode);
#endif
return 0;
}
Alternatively, we can use fwprintf and provide stdout as first argument. To do the same with fwprintf(stderr,format,args) (or perror(format, args)), we would need to _setmode the stderr as well.
I'm trying to print this medium shade unicode box in C: ▒
(I'm doing the exercises in K&R and then got sidetracked on the one about making a histogram...). I know my unix term (Mac OSX) can display the box because I saved a text file with the box, and used cat textfilewithblock and it printed the block.
So far I initially tried:
#include <stdio.h>
#include <wchar.h>
int main(){
wprintf(L"▒\n");
return 0;
}
and nothing printed
iMac-2$ ./a.out
iMac-2:clang vik$
I did a search and found this: unicode hello world for C?
And it seems like I still have to set a locale (even though the executing environment in utf8? I'm still trying to figure out why this step is necessary) But anyway, it works! (after a bit of a struggle finally realizing that the proper string was en_US.UTF-8 rather than en_US.utf8 which I had read somewhere...)
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(){
setlocale (LC_ALL, "en_US.UTF-8");
wprintf(L"▒\n");
return 0;
}
Output is as follows:
iMac-2$ ./a.out
▒
iMac-2$
But when I try the following code...putting in the UTF-8 hex (which I got from here: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=9472&unicodeinhtml=dec ) which is 0xe29692 for the box rather than pasting the box in itself, it doesn't work again.
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(){
setlocale (LC_ALL, "en_US.UTF-8");
wchar_t box = 0xe29692;
wprintf(L"%lc\n", box);
return 0;
}
I'm clearly missing something but can't quite figure out what it is.
The unicode value of the MEDIUM SHADE code point is not 0xe29692, it is 0x2592. <E2><96><92> is the 3 byte encoding for this code point in UTF-8.
You can print this thing either using the wide char APIs:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(void) {
setlocale(LC_ALL, "en_US.UTF-8");
wchar_t box = 0x2592;
wprintf(L"%lc\n", box); // or simply printf("%lc\n", box);
return 0;
}
Or simply by printing the UTF-8 encoding directly:
#include <stdio.h>
int main(void) {
printf("\xE2\x96\x92\n");
return 0;
}
Or if your text editor encodes the source file in UTF-8:
#include <stdio.h>
int main(void) {
printf("▒\n");
return 0;
}
But be aware that this will not work: putchar('▒');
Also for full unicode support and a few more goodies, I recommend using iTerm2 on MacOS.
The box character is U+2592, which translates to 0xE2 0x96 0x92 in UTF-8. This adaptation of your third program mostly works for me:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
setlocale (LC_ALL, "en_US.UTF-8");
wchar_t box = 0xe29692;
wprintf(L"%lc\n", box);
wprintf(L"\n\nX\n\n");
box = L'\u2592'; //0xE2 0x96 0x92 = U+2592
wprintf(L"%lc\n", box);
wprintf(L"\n\n0x%.8X\n\n", box);
box = 0x2592;
wprintf(L"%lc\n", box);
return 0;
}
The output I get is:
X
▒
0x00002592
▒
The first print operation produces nothing of use; the others work.
Testing on Mac OS X 10.10.5. I happen to be compiling with GCC 5.3.0 (which I compiled), but I got the same output with XCode 7.0.2 and clang.