Converting UTF-16 to UTF-8 using libiconv - c

I'm trying to convert an UTF-16 string into utf-8 and hit a little wall. The output string contains the caracters but with blank spaces!? The input is hi\0 and If I look at the output, it says h\0i\0 instead of hi\0.
Do you see the problem here? Many thanks!
size_t len16 = 3 * sizeof(wchar_t);
size_t len8 = 7;
wchar_t utf16[3] = { 0x0068, 0x0069, 0x0000 }, *_utf16 = utf16;
char utf8[7], *_utf8 = utf8;
iconv_t utf16_to_utf8 = iconv_open("UTF-8", "UTF-16LE");
size_t result = iconv(utf16_to_utf8, (char **)&_utf16, &len16, &_utf8, &len8);
printf("%d - %s\n", (int)result, utf8);
iconv_close(utf16_to_utf8);

The input data for iconv is always an opaque byte stream. When reading UTF-16, iconv expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.
In C++11 and C11 this should be char16_t, but you can also use uint16_t:
uint16_t data[] = { 0x68, 0x69, 0 };
char const * p = (char const *)data;
To be pedantic, there's nothing in general that says that uint16_t has two bytes. However, iconv is a Posix library, and Posix mandates that CHAR_BIT == 8, so it is true on Posix.
(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there's no difference between 0x68, 0x0068, or 0x00068. What's much more interesting are the new Unicode character literals \u and \U, but that's a whole different story.)

Related

Store multi-byte into char array

Following code works:
char *text = "中文";
printf("%s", text);
Then I'm trying to print this text via it's unicode code point which is 0x4e2d for "中" and 0x6587 for "文":
And sure, nothing prints out.
I'm trying to understand what's happening here when I store multi-byte string into char* and how to print multi-byte string with it's unicode code point, and further more, what does it mean by "Format specifier '%ls' requires 'wchar_t *' argument instead of 'wchar_t *'"?
Thanks for any help.
Edit:
I'm on Mac osx (high sierra 10.13.6), with clion
$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
wchar_t *arr = malloc(2 * sizeof(wchar_t));
arr[0] = 0x4e2d;
arr[1] = 0x6587;
First, the above string is not null-terminated. The printf function knows the beginning of the array, but it has no idea where the array ends, or what size it has. You have to add a zero at the end to make null-terminated C string.
To print this null-terminated wide string, use "printf("%ls", arr);" for Unix based machines (including Mac), use "wprintf("%s", arr);" in Windows (that's a completely different thing, it actually treats the string as UTF16)
Make sure to add setlocale(LC_ALL, "C.UTF-8"); or setlocale(LC_ALL, ""); for Unix based machines.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "C.UTF-8");
//print single character:
printf("%lc\n", 0x00004e2d);
printf("%lc\n", 0x00006587);
printf("%lc\n", 0x0001F310);
wchar_t *arr = malloc((2 + 1)* sizeof(wchar_t));
arr[0] = 0x00004e2d;
arr[1] = 0x00006587;
arr[2] = 0;
printf("%ls\n", arr);
return 0;
}
Aside,
In UTF32, code points always need 4 bytes (example 0x00004e2d) This can be represented with a 4 byte data type char32_t (or wchar_t in POSIX).
In UTF8, code points need 1, 2, 3, or 4 bytes. UTF8 encoding for ASCII characters needs one byte. While 中 needs 3 bytes (or 3 char values). You can confirm this by running this code:
printf("A:%d 中:%d 🙂:%d\n", strlen("A"), strlen("中"), strlen("🙂"));
Se we can't use a single char in UTF8. We can use strings instead:
const char* x = u8"中";
We can use normal string functions in C, like strcpy etc. But some standard C functions don't work. For example strchr just doesn't work for finding 中. This is usually not a problem because characters such as "print format specifiers" are all ASCII and are one byte.

Convert char to wchar_t using standard library?

I have a function that expects a wchar_t array as a parameter.I don't know of a standard library function to make a conversion from char to wchar_t so I wrote a quick dirty function, but I want a reliable solution free from bugs and undefined behaviors. Does the standard library have a function that makes this conversion ?
My code:
wchar_t *ctow(const char *buf, wchar_t *output)
{
const char ANSI_arr[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
const wchar_t WIDE_arr[] = L"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
size_t n = 0, len = strlen(ANSI_arr);
while (*buf) {
for (size_t x = 0; x < len; x++) {
if (*buf == ANSI_arr[x]) {
output[n++] = WIDE_arr[x];
break;
}
}
buf++;
}
output[n] = L'\0';
return output;
}
Well, conversion functions are declared in stdlib.h (*). But you must know that for any character in latin1 aka ISO-8859-1 charset the conversion to a wide character is a mere assignation, because character of unicode code below 256 are the latin1 characters.
So if your initial charset is ISO-8859-1, the convertion is simply:
wchar_t *ctow(const char *buf, wchar_t *output) {
wchar_t cr = output;
while (*buf) {
*output++ = *buf++;
}
*output = 0;
return cr;
}
provided caller passed a pointer to an array of size big enough to store all the converted characters.
If you are using any other charset, you will have to use a well known library like icu, or build one by hand, which is simple for single byte charsets (ISO-8859-x serie), more trikier for multibyte ones like UTF8.
But without knowing the charsets you want to be able to process, I cannot say more...
BTW, plain ascii is a subset of ISO-8859-1 charset.
(*) From cplusplus.com
int mbtowc (wchar_t* pwc, const char* pmb, size_t max);
Convert multibyte sequence to wide character
The multibyte character pointed by pmb is converted to a value of type wchar_t and stored at the location pointed by pwc. The function returns the length in bytes of the multibyte character.
mbtowc has its own internal shift state, which is altered as necessary only by calls to this function. A call to the function with a null pointer as pmb resets the state (and returns whether multibyte characters are state-dependent).
The behavior of this function depends on the LC_CTYPE category of the selected C locale.
It does in the header wchar.h. It is called btowc:
The btowc function returns WEOF if c has the value EOF or if (unsigned char)c
does not constitute a valid single-byte character in the initial shift state. Otherwise, it
returns the wide character representation of that character.
That isn't a conversion from wchar_t to char. It's a function for destroying data outside of ISO-646. No method in the C library will make that conversion for you. You can look at the ICU4C library. If you are only on Windows, you can look at the relevant functions in the Win32 API (WideCharToMultiByte, etc).

UTF-8 to Unicode conversion

I am having problems with converting UTF-8 to Unicode.
Below is the code:
int charset_convert( char * string, char * to_string,char* charset_from, char* charset_to)
{
char *from_buf, *to_buf, *pointer;
size_t inbytesleft, outbytesleft, ret;
size_t TotalLen;
iconv_t cd;
if (!charset_from || !charset_to || !string) /* sanity check */
return -1;
if (strlen(string) < 1)
return 0; /* we are done, nothing to convert */
cd = iconv_open(charset_to, charset_from);
/* Did I succeed in getting a conversion descriptor ? */
if (cd == (iconv_t)(-1)) {
/* I guess not */
printf("Failed to convert string from %s to %s ",
charset_from, charset_to);
return -1;
}
from_buf = string;
inbytesleft = strlen(string);
/* allocate max sized buffer,
assuming target encoding may be 4 byte unicode */
outbytesleft = inbytesleft *4 ;
pointer = to_buf = (char *)malloc(outbytesleft);
memset(to_buf,0,outbytesleft);
memset(pointer,0,outbytesleft);
ret = iconv(cd, &from_buf, &inbytesleft, &pointer, &outbytesleft);ing
memcpy(to_string,to_buf,(pointer-to_buf);
}
main():
int main()
{
char UTF []= {'A', 'B'};
char Unicode[1024]= {0};
char* ptr;
int x=0;
iconv_t cd;
charset_convert(UTF,Unicode,"UTF-8","UNICODE");
ptr = Unicode;
while(*ptr != '\0')
{
printf("Unicode %x \n",*ptr);
ptr++;
}
return 0;
}
It should give A and B but i am getting:
ffffffff
fffffffe
41
Thanks,
Sandeep
It looks like you are getting UTF-16 out in a little endian format:
ff fe 41 00 ...
Which is U+FEFF (ZWNBSP aka byte order mark), U+0041 (latin capital letter A), ...
You then stop printing because your while loop has terminated on the first null byte. The following bytes should be: 42 00.
You should either return a length from your function or make sure that the output is terminated with a null character (U+0000) and loop until you find this.
UTF-8 is Unicode.
You do not need to covert unless you need some other type of Unicode encoding like UTF-16, or UTF-32
UTF is not Unicode. UTF is an encoding of the integers in the Unicode standard. The question, as is, makes no sense. If you mean you want to convert from (any) UTF to the unicode code point (i.e. the integer that stands for an assigned code point, roughly a character), then you need to do a bit of reading, but it involves bit-shifting for the values of the 1, 2, 3 or 4 bytes in UTF-8 byte sequence (see Wikipedia, while Markus Kuhn's text is also excellent)
Unless I am missing something as nobody has pointed it out yet, "UNICODE" isn't a valid encoding name in libiconv as it is the name of a family of encodings.
http://www.gnu.org/software/libiconv/
(edit) Actually iconv -l shows UNICODE as a listed entry but no details, in the source code its listed in the notes as an alias for UNICODE-LITTLE but in the subnotes it mentions:
* UNICODE (big endian), UNICODEFEFF (little endian)
We DON'T implement these because they are stupid and not standardized.
In the aliases header files UNICODELITTLE (no hyphen) resolves as follows:
lib/aliases.gperf:UNICODELITTLE, ei_ucs2le
i.e. UCS2-LE (UTF-16 Little Endian), which should match Windows internal "Unicode" encoding.
http://en.wikipedia.org/wiki/UTF-16/UCS-2
However you are clearly recommended to explicitly specify UCS2-LE or UCS2-BE unless the first bytes are a Byte Order Mark (BOM) value 0xfeff to indicate byte order scheme.
=> You are seeing the BOM as the first bytes of the output because that is what the "UNICODE" encoding name means, it means UCS2 with a header indicating the byte order scheme.

wchar_t to octets - in C?

I'm trying to store a wchar_t string as octets, but I'm positive I'm doing it wrong - anybody mind to validate my attempt? What's going to happen when one char will consume 4 bytes?
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for (i=0;i< wcslen(wchar1);i++) {
printf("(%d)", (wchar1[i]) & 255);
printf("(%d)", (wchar1[i] >> 8) & 255);
}
Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.
UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.
Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?.
You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h
The prototype is as follows:
#include <stdlib.h>
size_t wcstombs(char *dest, const wchar_t *src, size_t n);
It will correctly convert your wchar_t string provided by src into a char (a.k.a. octets) string and write it to dest with at most n bytes.
char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;
memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);
/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
* before being finished with converting. If the limit WAS reached, the string
* will not be zero terminated and you must do it yourself - not happening here */
for (i = 0; i < length; i++)
printf("Octet #%d: '%02x'\n", i, mb_string[i]);
If you're trying to see the content of the memory buffer holding the string, you can do this:
size_t len = wcslen(str) * sizeof(wchar_t);
const char *ptr = (const char*)(str);
for (i=0; i<len; i++) {
printf("(%u)", ptr[i]);
}
I don't know why printf and wprintf do not work together. Following code works.
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for(i=0; i<wcslen(wchar1); i++)
{
wprintf(L"(%d)", (wchar1[i]) & 255);
wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}

Is there an easy way to convert a number to hexidecimal ASCII chars in C?

I am working a C firmware program for an embedded device. I want to send an array of hex char values over the serial port. Is there a simple way to convert a value to ASCII hex?
For example if the array contains 0xFF, I want to send out the ASCII string "FF", or for a hex value of 0x3B I want to send out "3B".
How is this typically done?
I already have the serial send functions in place so that I can do this...
char msg[] = "Send this message";
SendString(msg);
and the SendString function calls this function for each element in the passed array:
// This function sends out a single character over the UART
int SendU( int c)
{
while(U1STAbits.UTXBF);
U1TXREG = c;
return c;
}
I am looking for a function that will allow me to do this...
char HexArray[5] = {0x4D, 0xFF, 0xE3, 0xAA, 0xC4};
SendHexArray(HexArray);
//Output "4D, FF, E3, AA, C4"
Write the number to a string using sprintf and then use your existing SendString function to send that over the UART. You can, of course, do this one number at a time:
char num_str[3];
sprintf( num_str, "%02X", 0xFF );
SendString(num_str);
%02X is a format string for the printf family of functions, it says pad the element with 0s until width 2 and format the element as a hexadecimal number.
The 02 part ensures that when you want to print 0x0F that you get 0F instead of just F in the output stream. If you use a lowercase x you'll get lowercase characters (e.g. ff instead of FF).
The classic trick from the 8-bit micro in assembly language era is to break the conversion of a nybble into two segments. The value from 0 to 9 and the value from 10 to 15. Then simple arithmetic saves the 16-byte lookup table.
void SendDigit(int c) {
c &= 0x0f;
c += (c <= 9) ? '0' : 'A'-10;
SendU(c);
}
void SendArray(const unsigned char *msg, size_t len) {
while (len--) {
unsigned char c = *msg++;
SendDigit(c>>4);
SendDigit(c);
}
}
A couple of side notes are in order. First, this works because the digits and letters are each in contiguous spans in ASCII. If you are unfortunate enough to want EBCDIC, this still works as the letters 'A' through 'F' are contiguous there as well (but the rest of the alphabet is broken into three spans).
Second, I've changed the signature of SendArray(). My version is careful to make the buffer be unsigned, which is generally safer when planning to promote the bytes to some larger integral type to do arithmetic. If they are signed, then code like nibble[(*msg)>>4] might try to use a negative index into the array and the result won't be at all useful.
Finally, I added a length parameter. For a general binary dump, you probably don't have any byte value that makes sense to use as an end sentinel. For that, a count is much more effective.
Edit: Fixed a bug: for digits over 10, the arithmetic is c + 'A' - 10, not c + 'A'. Thanks to Brooks Moses for catching that.
I'd start with an array lookup.
char *asciihex[] = {
"00", "01", "02", ..., "0F",
"10", "11", "12", ..., "1F",
...,
"F0", "F1", "F2", ..., "FF"
};
and then simply look it up ...
SendString(asciihex[val]);
Edit
Incorporating Dysaster's nibbles idea:
void SendString(const char *msg) {
static const char nibble[] = {'0', '1', '2', ..., 'F'};
while (*msg) {
/* cast to unsigned char before bit operations (thanks RBerteig) */
SendU(nibble[(((unsigned char)*msg)&0xf0)>>4]); /* mask 4 top bits too, in case CHAR_BIT > 8 */
SendU(nibble[((unsigned char)*msg)&0x0f]);
msg++;
}
}
sprintf will do it.
sprintf (dest, "%X", src);
If your compiler supports it, you can use itoa. Otherwise, I'd use sprintf as in Nathan's & Mark's answers. If itoa is supported and performance is an issue, try some testing to determine which is faster (past experience leads me to expect itoa to be faster, but YMMV).

Resources