How are UTF strings literals interpreted in C?

How are UTF strings literals interpreted in C? - c

Say, I want to interpret(i.e. through stdout) a UTF32-encoded string: zß水🍌, taking the examples found on cppreference.
With U prefix, it's straightforward:
printf("%ls",U"zß水🍌")
However, it won't work unless an appropriate "locale" being set before print:
setlocale(LC_ALL, "XXX.UTF-8")
My simplified question is that, why should we use XXX.UTF-8 locale settings, instead of some others like XXX.UTF-32?
My confusion arises when I was testing the code below:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include <uchar.h>
void test2() {
char32_t w_str[] = U"zß水🍌";
printf("wchar width: %d\n", __WCHAR_WIDTH__);
if(__STDC_UTF_32__) printf("confirmed: utf32 used.\n");
printf("(with ' C' locale) wide string is interpreted as: ");
/* set locale */ if (setlocale(LC_ALL, "C") == NULL) perror("setlocale");
if (printf("[%ls]", w_str) == -1) { perror(" ERROR('C' locale)"); clearerr(stdout); }
printf("\n");
printf("(with 'utf8' locale) wide string is interpreted as: ");
/* set locale */ if (setlocale(LC_ALL, "en_US.UTF-8") == NULL) perror("setlocale");
if (printf("[%ls]", w_str) == -1) { perror(" ERROR('utf8' locale)"); clearerr(stdout); }
printf("\n");
}
int main(){test2();}
Along with output:
# gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
wchar width: 32
confirmed: utf32 used.
ERROR('C' locale): Invalid or incomplete multibyte or wide character
(with ' C' locale) wide string is interpreted as: [
(with 'utf8' locale) wide string is interpreted as: [zß水🍌]
According to C11-6.4.4.4-9:
Prefix : Corresponding Type
U : char32_t
and C11-6.10.8.2-1:
_ _ STDC_UTF_32 _ _ The integer constant 1, intended to indicate that values of type
char32_t are UTF−32 encoded.
My complete question is that, I've already specified an exact UTF32 string literal (and I'm 100% certain that they are exactly UTF32 encoded, by checking the assembly output of the code listed above), isn't it more appropriate to use some locale settings like XXX.UTF-32? If not, then why XXX.UTF-8 is qualified to decode the UTF32 byte sequence?

Related

How to use generic function GetDriveType

I have a C app that I need to compile in Windows. And I am really unable to wrap my head around the UNICODE and ANSI concept in Windows
I want to use GetDriveType function and there are 2 variables A and W. There is also a note here saying that GetDriveType is an alias to both and will select either based on some pre-processor.
But how should I call this function ?
This is what I am trying:
const TCHAR* path = "C:\\Users\\";
const TCHAR* trailing_slash = "\\";
size_t requiredSize = mbstowcs(NULL, path, 0);
TCHAR* win_path = (char*)malloc((requiredSize + 2) * sizeof(char));
UINT driveType = 0;
strncpy(win_path, path, requiredSize + 1);
strncat(win_path, trailing_slash, 2);
printf("Checking path: %s\n", win_path);
driveType = GetDriveType(win_path);
wprintf(L"Drive type is: %d\n", driveType);
if (driveType == DRIVE_FIXED)
printf("Success\n");
else
printf("Failure\n");
return 0;
It produces the result
Checking path: C:\Users\
Drive type is: 1
Failure
If I replace GetDriveType with GetDriveTypeA it returns the correct value 3 and succeeds.
I tried another variant too
size_t requiredSize = mbstowcs(NULL, path, 0);
uint32_t drive_type = 0;
const wchar_t *trailing_slash = L"\\";
wchar_t *win_path = (wchar_t*) malloc((requiredSize + 2) * sizeof(wchar_t));
/* Convert char* to wchar* */
size_t converted = mbstowcs(win_path, path, requiredSize+1);
/* Add a trailing backslash */
wcscat(win_path, trailing_slash);
/* Finally, check the path */
drive_type = GetDriveType(win_path);
I see this warning:
'function' : incompatible types - from 'wchar_t *' to 'LPCSTR'
So, which one to use ? How is it generic ? The path I will be reading is from an environment variable on Windows
What is TCHAR and wchar_t etc. ? I found this post, but could not understand much
This Microsoft post says
Depending on your preference, you can call the Unicode functions explicitly, such as SetWindowTextW, or use the macros
So is it Ok to use wchar_t everywhere and call GetDriveTypeW directly ?

Back in the mid-90s you had Windows 95/98/ME that did not support Unicode and NT4/2000/XP that did. You could create source code that could compile with or without Unicode support just by changing the UNICODE define.
This type of code looks like this:
UINT type = GetDriveType(TEXT("c:\\"));
There is no function named GetDriveType, 99% of all functions that take a string parameter in Windows have two versions, in this case GetDriveTypeA and GetDriveTypeW.
Inside the Windows header files you have code that looks like this:
#ifdef UNICODE
#define GetDriveType GetDriveTypeW
#else
#define GetDriveType GetDriveTypeA
#endif
If UNICODE is defined before including windows.h the above code expands to:
UINT type = GetDriveTypeW(L"c:\\");
and if not:
UINT type = GetDriveTypeA("c:\\");
These days most applications should use Unicode. Whether you should use wchar_t/WCHAR and call GetDriveTypeW directly or still rely on the defines is a style question. There might be situations where you need to force the A or W function and that is OK as well.
The same applies to the C library with the _TEXT macro and the _tcs functions except that those are controlled by the _UNICODE define.
If you get a warning about incompatible string types then you are calling the wrong function or you have not added #define UNICODE (and _UNICODE). If you are compiling cross platform code intended for Unix you might have to convert from char* to a wide string in some places.
See also:
TEXT vs. _TEXT vs. _T, and UNICODE vs. _UNICODE

mbstowcs() gives incorrect results in Windows

I am using mbstowcs() to convert a UTF-8 encoded char* string to wchar_t*, and the latter will be fed into _wfopen(). However, I always get a NULL pointer from _wfopen() and I have found the problem is from the result of mbstowcs().
I prepared the following example and used printf for debugging...
size_t out_size;
int requiredSize;
wchar_t *wc_filename;
char *utf8_filename = "C:/Users/xxxxxxxx/Desktop/\xce\xb1\xce\xb2\xce\xb3.stdf";
wchar_t *expected_output = L"C:/Users/xxxxxxxx/Desktop/αβγ.stdf";
printf("input: %s, length: %d\n", utf8_filename, strlen(utf8_filename));
printf("correct out length is %d\n", wcslen(expected_output));
// convertion start here
setlocale(LC_ALL, "C.UTF-8");
requiredSize = mbstowcs(NULL, utf8_filename, 0);
wc_filename = (wchar_t*)malloc( (requiredSize+1) * sizeof(wchar_t));
printf("requiredsize: %d\n", requiredSize);
if (!wc_filename) {
// allocation fail
free(wc_filename);
return -1;
}
out_size = mbstowcs(wc_filename, utf8_filename, requiredSize + 1);
if (out_size == (size_t)(-1)) {
// convertion fail
free(wc_filename);
return -1;
}
printf("out_size: %d, wchar name: %ls\n", out_size, wc_filename);
if (wcscmp (wc_filename, expected_output) != 0) {
printf("converted result is not correct\n");
}
free(wc_filename);
And the console output is:
input: C:/Users/xxxxxxxx/Desktop/αβγ.stdf, length: 37
correct out length is 34
requiredsize: 37
out_size: 37, wchar name: C:/Users/xxxxxxxx/Desktop/αβγ.stdf
converted result is not correct
I just don't know why expected_output and wc_filename have the same content but the length is different? What did I do wrong here?

The problem appears to be in your choice of locale name. Replacing the following:
setlocale(LC_ALL, "C.UTF-8");
with this:
setlocale(LC_ALL, "en_US.UTF-8");
fixes the issue on my system (Windows 10, MSVC, 64-bit build) – at least, the out_size and requiredSize are both 34 and the "converted result is not correct\n" message doesn't show. Using "en_GB.UTF-8" also worked.
I'm not sure if the C Standard actually defines what locale names are, but this question/answer may be helpful: Valid Locale Names.
Note: As mentioned in the comment by Mgetz, using setlocale(LC_ALL, ".UTF-8"); also works – I guess that would be the minimal and most portable locale name to use.
Second note: You can check if the setlocale call succeeded by comparing its return value to NULL. Using your original local name will give an error message if you use the following code (but not if you remove the leading "C"):
if (setlocale(LC_ALL, "C.UTF-8") == NULL) {
printf("Error setting locale!\n");
}

Universal CRT supports UTF-8, but MSVCRT.DLL is not.
When using MINGW, you need to link to UCRT.

Change the character encode in PostgreSQL C language function

I am using PostgreSQL 9.5 64bit version on windows server.
The character encoding of the database is set to UTF8.
I'd like to create a function that manipulates multibyte strings.
(e.g. cleansing, replace etc.)
I copied C language logic for manipulating characters from a other system,
The logic assumes that the character code is sjis.
I do not want to change C language logic, so I want to convert from UTF8 to sjis in C language function of Postgresql.
Like the convert_to function. (However, since the convert_to function returns bytea type, I want to obtain it with TEXT type.)
Please tell me how to convert from UTF 8 to sjis in C language.
Create Function Script:
CREATE FUNCTION CLEANSING_STRING(character varying)
RETURNS character varying AS
'$libdir/MyFunc/CLEANSING_STRING.dll', 'CLEANSING_STRING'
LANGUAGE c VOLATILE STRICT;
C Source:
#include <stdio.h>
#include <string.h>
#include <postgres.h>
#include <port.h>
#include <fmgr.h>
#include <stdlib.h>
#include <builtins.h>
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
extern PGDLLEXPORT Datum CLEANSING_STRING(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(CLEANSING_STRING);
Datum CLEANSING_STRING(PG_FUNCTION_ARGS)
{
// Get Arg
text *arg1 = (text *)PG_GETARG_TEXT_P(0);
// Text to Char[]
char *arg;
arg = text_to_cstring(arg1);
// UTF8 to Sjis
//Char *sjisChar[] = foo(arg); // something like that..
// Copied from other system.(Assumes that the character code is sjis.)
cleansingString(sjisChar);
replaceStrimg(sjisChar);
// Sjis to UTF8
//arg = bar(sjisChar); // something like that..
//Char[] to Text and Return
PG_RETURN_TEXT_P(cstring_to_text(arg));
}

Succeeded in the way I was taught by question comments.
#include <mb/pg_wchar.h> //Add to include.
...
Datum CLEANSING_STRING(PG_FUNCTION_ARGS)
{
// Get Arg
text *arg1 = (text *)PG_GETARG_TEXT_P(0);
// Text to Char[]
char *arg;
arg = text_to_cstring(arg1);
// UTF8 to Sjis
Char *sjisChar[] = pg_server_to_any(arg, strlen(arg), PG_SJIS);
// Copied from other system.(Assumes that the character code is sjis.)
cleansingString(sjisChar);
replaceStrimg(sjisChar);
// Sjis to UTF8
arg = pg_any_to_server(sjisChar, strlen(sjisChar), PG_SJIS); //It converts from SJIS to server (UTF 8), the third argument sets the encoding of the conversion source.
//Char[] to Text and Return
PG_RETURN_TEXT_P(cstring_to_text(arg));
}

iconv_open() returning EINVAL on Solaris 8

In Solaris 8, it looks like iconv*() family of functions is broken and only supports conversion between single-byte charsets and UTF-8, which can be verified using this code example:
#include <stdio.h>
#include <errno.h>
#include <iconv.h>
#if defined(__sun) && defined(__SVR4)
#define CP1251 "ansi-1251"
#define ISO_8859_5 "ISO8859-5"
#else
#define CP1251 "CP1251"
#define ISO_8859_5 "ISO-8859-5"
#endif
void iconv_open_debug(const char *, const char *);
int main() {
iconv_open_debug(CP1251, CP1251);
iconv_open_debug(CP1251, ISO_8859_5);
iconv_open_debug(CP1251, "KOI8-R");
iconv_open_debug(CP1251, "UTF-8");
iconv_open_debug(CP1251, "WCHAR_T");
iconv_open_debug(ISO_8859_5, CP1251);
iconv_open_debug(ISO_8859_5, ISO_8859_5);
iconv_open_debug(ISO_8859_5, "KOI8-R");
iconv_open_debug(ISO_8859_5, "UTF-8");
iconv_open_debug(ISO_8859_5, "WCHAR_T");
iconv_open_debug("KOI8-R", CP1251);
iconv_open_debug("KOI8-R", ISO_8859_5);
iconv_open_debug("KOI8-R", "KOI8-R");
iconv_open_debug("KOI8-R", "UTF-8");
iconv_open_debug("KOI8-R", "WCHAR_T");
iconv_open_debug("UTF-8", CP1251);
iconv_open_debug("UTF-8", ISO_8859_5);
iconv_open_debug("UTF-8", "KOI8-R");
iconv_open_debug("UTF-8", "UTF-8");
iconv_open_debug("UTF-8", "WCHAR_T");
iconv_open_debug("WCHAR_T", CP1251);
iconv_open_debug("WCHAR_T", ISO_8859_5);
iconv_open_debug("WCHAR_T", "KOI8-R");
iconv_open_debug("WCHAR_T", "UTF-8");
iconv_open_debug("WCHAR_T", "WCHAR_T");
return 0;
}
void iconv_open_debug(const char *from, const char *to) {
errno = 0;
if (iconv_open(to, from) == (iconv_t) -1) {
fprintf(stderr, "iconv_open(\"%s\", \"%s\") FAIL: errno = %d\n", to, from, errno);
perror("iconv_open()");
} else {
fprintf(stdout, "iconv_open(\"%s\", \"%s\") PASS\n", to, from);
}
}
which only prints
iconv_open("UTF-8", "ansi-1251") PASS
iconv_open("UTF-8", "ISO8859-5") PASS
iconv_open("UTF-8", "KOI8-R") PASS
iconv_open("ansi-1251", "UTF-8") PASS
iconv_open("ISO8859-5", "UTF-8") PASS
iconv_open("KOI8-R", "UTF-8") PASS
to stdout and returns EINVAL for other pairs. Note that even conversion to the same charset (e.g. UTF-8 -> UTF-8) is not supported.
Questions
Can anyone reference a document describing the limitations of Solaris version of iconv.h?
How can I convert a wchar_t* to a single- or multibyte string w/o relying on GNU libiconv? wcstombs() would be fine except that it relies on the current locale's charset, while I want a wide string converted to a regular string using a particular charset, possibly different from the default one.

Running sdtconvtool shows most legacy codepages are supported.
After re-running the same utility with truss -u libc::iconv_open, I learnt that conversion from one single-byte encoding to another single-byte one is done in two steps, with intermediate conversion to UTF-8.
Speaking of conversion from "WCHAR_T", iconv(3) also does support it, but "UCS-4" should be used as a source charset name since sizeof(wchar_t) is 4 on Solaris (for both x86 and SPARC).

Wrong glyphs displayed when using emWin and Korean fonts

I am using SEGGER emWin on an embedded system.
I have downloaded a Korean font: Korean True Type Font
And converted the font to C language data statements.
When I printed the text: 한국어 ("Korean"), nothing printed out.
The hex code for the text (UTF-8) is: \xED\x95\x9C\xEA\xB5\xAD\xEC\x96\xB4
I opened up the font in the Font Creator and noticed the glyph at offset 0xED does not match the first glyph in the text. Also, there are no glyphs at offset 0xED95 or 0x95ED.
I converted the file using 16-bit Unicode.
The hex code for the text was determined by using Google Translate, then copying the text into Notepad, saving the text as UTF-8 and then opening up the text file with a hex editor.
How do I get the hex string to print the appropriate glyphs?
Am I having a Unicode vs. UTF-8 issues?
Edit 1:
I am not calling any functions to change the encoding, as I am confused on that part.
Here's the essential code:
// alphabetize languages for display
static const Languages_t Language_map[] =
{
{"Deutsch", ESG_LANG_German__Deutsch_},
{"English", ESG_LANG_English},
{"Espa\303\361ol", ESG_LANG_Spanish__Espanol_},
{"Fran\303\247ais", ESG_LANG_French__Francais_}, /* parasoft-suppress MISRA2004-7_1 "octal sequence needed for text accents on foreign language text" */
{"Italiano", ESG_LANG_Italian__Italiano_},
{"Nederlands", ESG_LANG_Dutch__Nederlands_},
{"Portugu\303\252s", ESG_LANG_Portuguese__Portugues_}, /* parasoft-suppress MISRA2004-7_1 "octal sequence needed for text accents on foreign language text" */
{"Svenska", ESG_LANG_Swedish__Svenska_},
{"\xED\x95\x9C\xEA\xB5\xAD\xEC\x96\xB4",ESG_LANG_Korean}, // UTF-8
// {"\xFF\xFE\x5c\xD5\x6D\xAD\xB4\xC5", ESG_LANG_Korean}, // Unicode
};
for (index = ESG_LANG_English; index < ESG_LANG_MAX_LANG; index++)
{
if (index == ESG_LANG_Korean)
{
GUI_SetFont(&Font_KTimesSSK22_12pt);
}
else
{
GUI_SetFont(&GUI_FontMyriadPro_Semibold_22pt);
}
if (index == language)
{
GUI_SetColor(ESG_WHITE);
}
else
{
GUI_SetColor(ESG_AMR_BLUE);
}
(void) GUI_SetTextAlign(GUI_TA_HCENTER);
GUI_DispStringAt(Language_map[index].name,
(signed int)Language_position[index].x,
(signed int)Language_position[index].y);
}
//...
void GUI_DispStringAt(const char GUI_UNI_PTR *s, int x, int y) {
GUI_LOCK();
GUI_pContext->DispPosX = x;
GUI_pContext->DispPosY = y;
GUI_DispString(s);
GUI_UNLOCK();
}
The GUI_UNI_PTR is not for Unicode, but for "Universal":
/* Define "universal pointer". Normally, this is not needed (define will expand to nothing)
However, on some systems (AVR - IAR compiler) it can be necessary ( -> __generic),
since a default pointer can access RAM only, not the built-in Flash
*/
#ifndef GUI_UNI_PTR
#define GUI_UNI_PTR
#define GUI_UNI_PTR_USED 0
#else
#define GUI_UNI_PTR_USED 1
#endif

The emWin is performing correctly.
The system is set up for UTF-8 encodings.
The issue is finding a truetype unicode font that contains all the glyphs (bitmaps) for the Korean language. Many fonts claim to support Korean, but their glyphs are in the wrong place for unicode.