Convert char to wchar_t using standard library? - c

I have a function that expects a wchar_t array as a parameter.I don't know of a standard library function to make a conversion from char to wchar_t so I wrote a quick dirty function, but I want a reliable solution free from bugs and undefined behaviors. Does the standard library have a function that makes this conversion ?
My code:
wchar_t *ctow(const char *buf, wchar_t *output)
{
const char ANSI_arr[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
const wchar_t WIDE_arr[] = L"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
size_t n = 0, len = strlen(ANSI_arr);
while (*buf) {
for (size_t x = 0; x < len; x++) {
if (*buf == ANSI_arr[x]) {
output[n++] = WIDE_arr[x];
break;
}
}
buf++;
}
output[n] = L'\0';
return output;
}

Well, conversion functions are declared in stdlib.h (*). But you must know that for any character in latin1 aka ISO-8859-1 charset the conversion to a wide character is a mere assignation, because character of unicode code below 256 are the latin1 characters.
So if your initial charset is ISO-8859-1, the convertion is simply:
wchar_t *ctow(const char *buf, wchar_t *output) {
wchar_t cr = output;
while (*buf) {
*output++ = *buf++;
}
*output = 0;
return cr;
}
provided caller passed a pointer to an array of size big enough to store all the converted characters.
If you are using any other charset, you will have to use a well known library like icu, or build one by hand, which is simple for single byte charsets (ISO-8859-x serie), more trikier for multibyte ones like UTF8.
But without knowing the charsets you want to be able to process, I cannot say more...
BTW, plain ascii is a subset of ISO-8859-1 charset.
(*) From cplusplus.com
int mbtowc (wchar_t* pwc, const char* pmb, size_t max);
Convert multibyte sequence to wide character
The multibyte character pointed by pmb is converted to a value of type wchar_t and stored at the location pointed by pwc. The function returns the length in bytes of the multibyte character.
mbtowc has its own internal shift state, which is altered as necessary only by calls to this function. A call to the function with a null pointer as pmb resets the state (and returns whether multibyte characters are state-dependent).
The behavior of this function depends on the LC_CTYPE category of the selected C locale.

It does in the header wchar.h. It is called btowc:
The btowc function returns WEOF if c has the value EOF or if (unsigned char)c
does not constitute a valid single-byte character in the initial shift state. Otherwise, it
returns the wide character representation of that character.

That isn't a conversion from wchar_t to char. It's a function for destroying data outside of ISO-646. No method in the C library will make that conversion for you. You can look at the ICU4C library. If you are only on Windows, you can look at the relevant functions in the Win32 API (WideCharToMultiByte, etc).

Related

sscanf_s doesn't return first character of string

I'm trying to find the first string (max 4 characters) in a comma-separated list of strings inside a C char-array.
I'm trying to achieve this by using sscanf_s (under Windows) and the format-control string %[^,]:
char mystring[] = "STR1,STR2";
char temp[5];
if (sscanf_s(mystring, "%[^,]", temp, 5) != 0) {
if (strcmp(temp, "STR1") == 0) { return 0; }
else if (strcmp(temp, "STR2") == 0) { return 1; }
else { return -1; }
}
After calling sscanf_s the content of temp is not STR1 but \0TR1 (\0 being the ASCII-interpretation of 0). And the value -1 is returned.
Why do I get that behavior and how do I fix my code to get the right result (return of 0)?
EDIT: changed char mystring to mystring[] (I should have made sure I typed it correcly here)
There are multiple problems in your code:
mystring is defined as a char, not a string pointer.
the argument 5 following temp in sscanf_s() should have type rsize_t, which is the same as size_t. You should specify it as sizeof(temp).
you should specify the maximum number of characters to store into the destination array in the format string, to avoid the counter-intuitive behavior of sscanf_s in case of overflow.
sscanf_s returns 1 if it can convert and store the string. Testing != 0 will also accept EOF which is an input failure, for which the contents of temp is indeterminate.
Here is a modified version:
const char *mystring = "STR1,STR2";
char temp[5];
if (sscanf_s(mystring, "%4[^,]", temp, sizeof temp) == 1) {
if (strcmp(temp, "STR1") == 0) {
return 0;
} else
if (strcmp(temp, "STR2") == 0) {
return 1;
} else {
return -1;
}
}
UPDATE: The OP uses Microsoft Visual Studio, which seems to have a non-conforming implementation of the so-called secure stream functions. Here is a citation from their documentation page:
The sscanf_s function reads data from buffer into the location that's given by each argument. The arguments after the format string specify pointers to variables that have a type that corresponds to a type specifier in format. Unlike the less secure version sscanf, a buffer size parameter is required when you use the type field characters c, C, s, S, or string control sets that are enclosed in []. The buffer size in characters must be supplied as an additional parameter immediately after each buffer parameter that requires it. For example, if you are reading into a string, the buffer size for that string is passed as follows:
wchar_t ws[10];
swscanf_s(in_str, L"%9s", ws, (unsigned)_countof(ws)); // buffer size is 10, width specification is 9
The buffer size includes the terminating null. A width specification field may be used to ensure that the token that's read in will fit into the buffer. If no width specification field is used, and the token read in is too big to fit in the buffer, nothing is written to that buffer.
In the case of characters, a single character may be read as follows:
wchar_t wc;
swscanf_s(in_str, L"%c", &wc, 1);
This example reads a single character from the input string and then stores it in a wide-character buffer. When you read multiple characters for non-null terminated strings, unsigned integers are used as the width specification and the buffer size.
char c[4];
sscanf_s(input, "%4c", &c, (unsigned)_countof(c)); // not null terminated
This example reads a single character from the input string and then stores it in a wide-character buffer. When you read multiple characters for non-null terminated strings, unsigned integers are used as the width specification and the buffer size.
char c[4];
sscanf_s(input, "%4c", &c, (unsigned)_countof(c)); // not null terminated
This specification is incompatible with the C Standard, that specifies the type of the width arguments to be rsize_t and type rsize_t to be the same type as size_t.
As a conclusion, for improved portability, one should avoid using these secure functions and use the standard functions correctly, with the length prefix to prevent buffer overruns.
You can prevent the Visual Studio warning about deprecation of sscanf by inserting this definition before including <stdio.h>:
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS // let me use standard functions
#endif
edited per the comment from chqrlie
regarding:
if(sscanf_s(mystring, "%[^,]",temp, 5) != 0){
The input format conversion specifier: %[..] always appends a NUL byte to the end of the input. So the input format conversion specifier should be: "%4[^,]" The result after the correction is:
if(sscanf_s(mystring, "%4[^,]",temp, 5) != 0){
also, no matter how many times this code snippet is executed, the returned value wnce the other problems are corrected will ALWAYS be STR1
regarding the statement;
char mystring = "STR1,STR2";
This is not a valid statement. Suggest:
char *mystring = "STR1,STR2"; // notice the '*'
--or--
char mystring[] = "STR1,STR2"; // notice the '[]'

Why isn't my 8-bit pointer updating correctly?

Directions are here:
I'm supposed to convert the ASCII string in the exact same input buffer, which is in this case pt1.
Unfortunately for me, the loop is executed only once and hence my output buffer only contains the first short value.
I'm trying to convert the ASCII string into a Unicode 16-bit string. According to the directions, pt1 is supposed to point to an ASCII string.
Expected Output is on this link.
https://i.stack.imgur.com/COpXl.jpg
void Convert(unsigned short *pt1) {
// pt1 is a pointer to a null-terminated variable length ASCII string
// 0x30 0x31 0x32 0x00 (sentinel value)
unsigned char *pt2 = (unsigned char *)pt1;
unsigned char value = *pt2;
while (*pt2 != 0x00) {
value = *pt2;
*pt1 = (unsigned short)value;
pt2++;
pt1++;
}
*pt1 = 0x0000;
}
There are multiple problems:
Your conversion function does not produce anything visible for the caller: you store code point values into a local array and return to the caller. The compiler warns you that at least pt3 is set and not used, but a more advanced compiler would optimise away all code for this function without a side effect.
What is the API description for Convert? You seem to receive a pointer to an ASCII string disguised as a pointer to unsigned short, and it seems the conversion should be performed in place. If this is the actual requirement, it is a very bad idea. The function should receive a pointer to a destination array with type unsigned short *, a size_t specifying the number of elements of this array and a pointer to the source string, with type const char *.
How should bytes outside the ASCII range be handled? Is the source string encoded in a given code page? it is UTF-8 encoded? Should the function report an error?
From the EDIT, you seem to confirm the insane API requirement. Assuming there is enough space in the argument array, you should perform the conversion from the last byte to the first, thus avoiding stepping on your own toes:
void Convert(unsigned short *pt1) {
// pt1 is a pointer to a null-terminated variable length ASCII string
// with enough space to receive the converted value including a null terminator
unsigned char *pt2 = (unsigned char *)pt1;
size_t i;
// Compute the number of bytes
for (i = 0; pt2[i] != '\0'; i++)
continue;
// Convert the contents from right to left
// Assuming ISO8859-1 encoding for bytes outside the ASCII range
for (;;) {
pt1[i] = (unsigned short)pt2[i];
if (i-- == 0)
break;
}
}

unicode string in C extension

I am writing a C extension for Ruby, and I need to accept a string as a parameter, and iterate the characters in the string. My code below works fine for ASCII characters, but it does not handle multiple byte characters, and outputs "garbage" instead. I could not find any sample code that would iterate over unicode strings. I would appreciate any pointers.
static VALUE test_method(VALUE self, VALUE text)
{
char *pch;
char *pch_end = RSTRING_END(text);
for (pch = RSTRING_PTR(text); pch < pch_end; pch++)
{
printf("%c\n", *pch);
}
...
}
Here’s an example of one way you could iterate over the characters:
static VALUE print_single_char(VALUE s)
{
char* pch;
pch = StringValueCStr(s);
// pch is now a pointer to a sequence of bytes representing the
// character in whatever its encoding was. printf will work if the
// console encoding is the same, otherwise you may get junk again.
printf("%s\n", pch);
return Qnil;
}
static VALUE test_method(VALUE self, VALUE text)
{
rb_block_call(text, rb_intern("each_char"), 0, NULL, print_single_char, Qnil);
return Qnil;
}
Note that once you convert any characters to C-strings you lose any associated encoding information. You might want to convert any input into a known encoding (such as UTF-8) before doing anything else:
text = rb_funcall(text, rb_intern("encode"), 1, rb_str_new_cstr("utf-8"));
char is only of size 1 so if you deal with multibyte characters you would have to use wchar_t instead and use the appropriate wide versions as well like wprintf.

Converting UTF-16 to UTF-8 using libiconv

I'm trying to convert an UTF-16 string into utf-8 and hit a little wall. The output string contains the caracters but with blank spaces!? The input is hi\0 and If I look at the output, it says h\0i\0 instead of hi\0.
Do you see the problem here? Many thanks!
size_t len16 = 3 * sizeof(wchar_t);
size_t len8 = 7;
wchar_t utf16[3] = { 0x0068, 0x0069, 0x0000 }, *_utf16 = utf16;
char utf8[7], *_utf8 = utf8;
iconv_t utf16_to_utf8 = iconv_open("UTF-8", "UTF-16LE");
size_t result = iconv(utf16_to_utf8, (char **)&_utf16, &len16, &_utf8, &len8);
printf("%d - %s\n", (int)result, utf8);
iconv_close(utf16_to_utf8);
The input data for iconv is always an opaque byte stream. When reading UTF-16, iconv expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.
In C++11 and C11 this should be char16_t, but you can also use uint16_t:
uint16_t data[] = { 0x68, 0x69, 0 };
char const * p = (char const *)data;
To be pedantic, there's nothing in general that says that uint16_t has two bytes. However, iconv is a Posix library, and Posix mandates that CHAR_BIT == 8, so it is true on Posix.
(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there's no difference between 0x68, 0x0068, or 0x00068. What's much more interesting are the new Unicode character literals \u and \U, but that's a whole different story.)

wchar_t to octets - in C?

I'm trying to store a wchar_t string as octets, but I'm positive I'm doing it wrong - anybody mind to validate my attempt? What's going to happen when one char will consume 4 bytes?
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for (i=0;i< wcslen(wchar1);i++) {
printf("(%d)", (wchar1[i]) & 255);
printf("(%d)", (wchar1[i] >> 8) & 255);
}
Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.
UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.
Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?.
You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h
The prototype is as follows:
#include <stdlib.h>
size_t wcstombs(char *dest, const wchar_t *src, size_t n);
It will correctly convert your wchar_t string provided by src into a char (a.k.a. octets) string and write it to dest with at most n bytes.
char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;
memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);
/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
* before being finished with converting. If the limit WAS reached, the string
* will not be zero terminated and you must do it yourself - not happening here */
for (i = 0; i < length; i++)
printf("Octet #%d: '%02x'\n", i, mb_string[i]);
If you're trying to see the content of the memory buffer holding the string, you can do this:
size_t len = wcslen(str) * sizeof(wchar_t);
const char *ptr = (const char*)(str);
for (i=0; i<len; i++) {
printf("(%u)", ptr[i]);
}
I don't know why printf and wprintf do not work together. Following code works.
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for(i=0; i<wcslen(wchar1); i++)
{
wprintf(L"(%d)", (wchar1[i]) & 255);
wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}

Resources