C Wide characters - how to use them? - c

I'm able to output a single character using this code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
main(){
setlocale(LC_CTYPE, "");
wchar_t a = L'Ö';
putwchar(a);
}
How can I adapt the code to output a string?
Something like
wchar_t *a = L"ÖÜÄöüä";
wprinf("%ls", a);

wprintf(L"%ls", str)

It's a bit tricky, you have to know what your internal wchar_ts mean. (See here for a little discussion.) Basically you should communicate with the environment via mbstowcs/wcstombs, and with data with known encoding via iconv (converting from and to WCHAR_T).
(The exception here is Windows, where you can't really communicate with the environment meaningfully, but you can access it in a wide version directly with Windows API functions, and you can write wide strings directly into message boxes etc.)
That said, once you have your internal wide string, you can convert it to the environment's multibyte string with wcstombs, or you can just use printf("%ls", mywstr); which performs the conversion for you. Just don't forget to call setlocale(LC_CTYPE, "") at the very beginning of your program.

Related

Why does printing this wide character string crash on windows?

I stumbled upon a problem while going through some unit tests, and I am not entirely sure why the following simple example crashes on the line with sprintf (Using Windows with Visual Studio 2019).
#include <stdio.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "en_US.utf8");
char output[255];
sprintf(output, "simple %ls text", L"\u00df\U0001d10b");
return 0;
}
Is there something wrong with the code?
char is 8-bit and wchar_t is 16-bit. When you try to convert the two, you will have to use functions like MultiByteToWideChar to convert between the two.
When you try to use Unicode strings in a multi-byte function, it causes buffer overflow, which might be the cause of your crashes.
Try using swprintf_s instead.

Multi-platform Unicode handling based on char* in C without using 3rd party libraries?

The following are bare minimum examples (I know that e.g. UNICODE/_UNICODE should be defined) that I've found to work:
Linux:
#include <stdio.h>
int main() {
char* str = "Rölf";
printf("%s\n", str);
}
Windows:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
wprintf(L"%s\n", str);
}
Now, I've read that one way of going about it is to basically "just use UTF-8/char everywhere and worry about platform-specific conversion when you do API calls".
And that would be great - have users provide char* as input for my library and "simply" convert that. So I've tried the following snippet based on this example (I've also seen it in variations elsewhere). If this would actually work, it would be amazing. But it doesn't:
char* str = u8"Rölf";
int len = mbstowcs(NULL, str, 0) + 1;
wchar_t wstr[len];
mbstowcs(wstr, str, len);
wprintf(L"%s\n", wstr);
I've also stumbled across discussions about console fonts and whatnot being the cause of faulty rendering, so to demonstrate that this is not a console issue - the following doesn't work either (well - the L"" literal does. The converted u8 literal doesn't):
MessageBoxW(NULL, wstr, L"Rölf", MB_OK);
Am I misunderstanding the conversion process? Is there a way to make to this work? (Without using e.g. ICU)
The mbstowcs function converts from a string encoded in the current locale's encoding to wchar_t[], not from UTF-8 (unless that encoding is UTF-8). On post-April-beta-2018 versions of Windows 10 or later, you actually can fix Windows to use UTF-8 as the encoding for plain char[] strings either as a global setting, or presumably by calling _setmbcp(65001). Older versions of Windows explicitly forbid this however for dubious historical reasons.
Anyway, you second version of the code which you called "Windows" should work on arbitrary systems if not for a bug in MSVC's wprintf that you worked around: they have the meanings of %ls and %s backwards for the wide stdio functions. In standard C, you need %ls to format a wchar_t[] string. But there's actually no reason to use wprintf there at all, and in fact wprintf is highly problematic because you can't mix it with byte-oriented stdio (doing so invokes undefined behavior). So better would be:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
printf("%ls\n", str);
}
and this version should work correctly on Windows and standards-conforming C implementations, since for the byte-oriented printf functions, MSVC doesn't have the meaning of %s and %ls reversed.
If you really want to, you can also use a variant of your third version of the code, but you can't use mbstowcs to convert from UTF-8 to wchar_t. Instead you need to either:
Assume wchar_t is Unicode-encoded, and convert from UTF-8 to Unicode codepoints with your own (or a third-party library's) UTF-8 decoder. But this is a bad assumption, because MSVC is also non-conforming in that it uses UTF-16 for wchar_t (C explicitly forbids "multi-wchar_t-characters because the mb/wc APIs are inherently incompatible with them), not Unicode codepoint values (equivalent to UTF-32).
Convert from UTF-8 to uchar32_t (UTF-32) with your own (or a third-party library's) UTF-8 decoder, then use c32rtomb to convert to wchar_t[].
Use iconv (standard on POSIX systems; available as a third-party library on Windows) to convert directly from UTF-8 to wchar_t.
UTF8 option for Windows 10, version 1803+
Thanks to Barmak Shemirani making me aware of MultiByteToWideChar, I've found a solution to this that is even C99 conform. (Which works on Windows 7 by the way)
Note that setlocale() is only necessary for console output to render correctly. I didn't use it to highlight that it doesn't seem to be needed for GUI-related API calls.
#define UNICODE
#define _UNICODE
#include <stdio.h>
#include <windows.h>
//#include <locale.h>
wchar_t* toWide(char* str) {
int wchars_num = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);
wchar_t* wstr = (wchar_t*)malloc(sizeof(wchar_t) * wchars_num);
MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wchars_num);
return wstr;
}
int main() {
// For output in console to render correctly - as far as the font allows anyway...
//setlocale(LC_ALL, "");
// PLATFORM-AGNOSTIC DATA STRUCTURE WITH UTF-8 TEXT
// (Usually not directly next to the platform-specific API calls...)
char* str = "Rölf";
// PLATFORM-SPECIFIC TEXT HANDLING
wchar_t* wstr = toWide(str);
printf("%ls\n", wstr);
MessageBox(NULL, wstr, L"Rölf", MB_OK);
free(wstr);
}
The way I use it is that I declare a data structure to be filled by my users where all text is char* and assumed to be UTF-8. Then in my library, I use platform-specific UI APIs. And in the case of Windows, doing the above UTF-16 conversion is obviously necessary.

wide characters transliteration C with towctrans

I would like to convert (transliterate) UTF-8 characters to be closest match in ASCII in C. Characters like ú is transliterated to u. I can do that with iconv, with iconv -f utf-8 -t ascii//TRANSLIT, on the command line.
In C, there is a function towctrans to do that, but I only found documentation about two possible transliterations: to lower case and to upper case (see man wctrans). On the documentation, wctrans depends on LC_CTYPE. But what other function (other than "tolower" and "toupper") are available for a specific LC_CTYPE value?
A simple example with towctrans and the basic toupper transliteration:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t frase[] = L"Amélia"; int i;
setlocale(LC_ALL, "");
for (i=0; i < wcslen(frase); i++) {
printf("%lc --> %lc\n", frase[i], towctrans(frase[i], wctrans("toupper")));
}
}
I know I can do this conversion with libiconv, but I was trying to find out possible already defined wctrans functions.
While the standard allows for implementation-defined or locale-defined transformations via wctrans, I'm not aware of any existing implementations that offer such a feature, and it's certainly not widespread. The iconv approach of //TRANSLIT is also non-standard and in fact conflicts with the standard: POSIX requires a charset name containing a slash character to be interpreted as a pathname to a charmap file, so use of slash for specifying translit-mode is non-conforming.

How to find the built-in function to deal with char16_t in C?

Please tell what is the char16_t version for the String Manipulation Functions
such as:
http://www.tutorialspoint.com/ansi_c/c_function_references.htm
I found many references site, but no one mentioned that.
Especially for printing function, this is that most important, because it help me to verify whether the Manipulation function is work.
#include <stdio.h>
#include <uchar.h>
char16_t *u=u"α";
int main(int argc, char *argv[])
{
printf("%x\n",u[0]); // output 3b1, it is UTF16
wprintf("%s\n",u); //no ouput
_cwprintf("%s\n",u); //incorrect output
return 0;
}
To print/read/open write etc.., you need to convert to 32-bit chars using the mbsrtowcs function.
For ALL intents and purposes, char16_t is a multi-byte representation, therefore, one need use mbr functions to work with this integral type.
A few answers used the L"prefix" which is completely incorrect. 16-bit strings require the u"prefix".
The following code gets you everything you need to work with 8, 16, and 32-bit string representations.
#include <string.h>
#include <wchar.h>
#include <uchar.h>
You can Google the procedures found in <wchar.h> if you don't have manual pages (UNIX).
Gnome.org's GLib has some great code for you to drop-in if overhead isn't an issue.
char16_t and char32_t are ISO C11 (iso9899:2011) extensions.
wprintf and its wchar colleagues need to have th format string in wchar too:
wprintf( L"%s\n", u);
For wchar L is used as a prefix to the string literals.
Edit:
Here's a code snippet (tested on Windows):
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <wchar.h>
void main()
{
wchar_t* a = L"α";
fflush(stdout); //must be done before _setmode
_setmode(_fileno(stdout), _O_U16TEXT); // set console mode to unicode
wprintf(L"alpha is:\n\t%s\n", a); // works for me :)
}
The console doesn't work in unicode and prints a "?" for non ascii chars. In Linux you need to remove the underscore prefix before setmode and fileno.
Note: for windows GUI prints, there already proper support, so you can use wsprintf to format unicode strings.

iterating through a char array with non standard chars

Edit:
I can only use stdio.h and stdlib.h
I would like to iterate through a char array filled with chars.
However chars like ä,ö take up twice the space and use two elements.
This is where my problem lies, I don't know how to access those special chars.
In my example the char "ä" would use hmm[0] and hmm[1].
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char* hmm = "äö";
printf("%c\n", hmm[0]); //i want to print "ä"
printf("%i\n", strlen(hmm));
return 0;
}
Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element.
So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* hmm = "äüö";
printf("%c\n", hmm[0]);
printf("%c\n", hmm[1]);
printf("%c\n", hmm[2]);
return 0;
}
A char always used one byte.
In your case you think that "ä" is one char: Wrong.
Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8
Now the question is do you want to use wide character ?
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main()
{
const wchar_t hmm[] = L"äö";
setlocale(LC_ALL, "");
wprintf(L"%ls\n", hmm);
wprintf(L"%lc\n", hmm[0]);
wprintf(L"%i\n", wcslen(hmm));
return 0;
}
Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void)
{
char* hmm = "äö";
int off = 0;
int len;
int max = strlen(hmm);
setlocale(LC_ALL, "");
printf("<<%s>>\n", hmm);
printf("%zi\n", strlen(hmm));
while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
{
printf("<<%.*s>>\n", len, &hmm[off]);
off += len;
}
return 0;
}
On my Mac, it produced:
<<äö>>
4
<<ä>>
<<ö>>
The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:
<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>
The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.
You can also use wide characters and wide-character printing, as shown in benjarobin's answer..
Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.
Firstly you can open Character Map, and find that
äö
Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.
The answer, of using
setlocale( LC_ALL, "" )
Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.
Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.
Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.
Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.
setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.
If it was a multie byte character set problem then there would be a whole list of other issues:
Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.
Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.
To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.
For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.
In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>
// File saved as UTF-8, with characters outside the ASCII range
int main()
{
// Set the OEM code page to be the default OS code page
setlocale(LC_ALL, "");
// äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
// Thus, requires a lead byte per unicode code point when saving as UTF-8
char* hmm = "äö";
printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
printf("Length:%d\n", strlen(hmm));
// Convert the UTF-8 String to a wide character
int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
LPWSTR lpszW = new WCHAR[nLen];
MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);
// Print it
wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW);
// Free the memory
delete[] lpszW;
int c = getchar();
return 0;
}
UTF-8 file string using Windows 1252 code page read as:äö
Length:4
wprintf wide character of UTF-8 string: äö
i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.

Resources