Converting Unicode codepoints to UTF-8 in C using iconv

Converting Unicode codepoints to UTF-8 in C using iconv - c

I want to convert a 32-bit value, which represents a Unicode codepoint, into a sequence of chars which is the utf-8 encoded string containing only the character corresponding to the codepoint.
For example, I want to turn the value 955 into the utf-8 encoded string "λ".
I tried to do this using iconv, but I could not get the desired result. Here is the code that I wrote:
#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
int main(void)
{
uint32_t codepoint = U'λ';
char *input = (char *) &codepoint;
size_t in_size = 2; // lower-case lambda is a 16-bit character (0x3BB = 955)
char output_buffer[10];
char *output = output_buffer;
size_t out_size = 10;
iconv_t cd = iconv_open("UTF-8", "UTF-32");
iconv(cd, &input, &in_size, &output, &out_size);
puts(output_buffer);
return 0;
}
When I run it, only a newline is printed (puts automatically prints a newline,-- the first byte of outout_buffer is '\0').
What is wrong with my understanding or my implementation?

As said by minitech, you must use size = 4 for UTF32 in an uint32_t, and you must preset the buffer to null to have the terminating null after conversion.
This code works on Ubuntu :
#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
#include <memory.h>
int main(void)
{
uint32_t codepoint = 955;
char *input = (char *) &codepoint;
size_t in_size = 4; // lower-case lambda is a 16-bit character (0x3BB = 955)
char output_buffer[10];
memset(output_buffer, 0, sizeof(output_buffer));
char *output = output_buffer;
size_t out_size = 10;
iconv_t cd = iconv_open("UTF-8", "UTF-32");
iconv(cd, &input, &in_size, &output, &out_size);
puts(output_buffer);
return 0;
}

Two problems:
Since you’re using UTF-32, you need to specify 4 bytes. The “lower-case lambda is a 16-bit character (0x3BB = 955)” comment isn’t true for a 4-byte fixed-width encoding; it’s 0x000003bb. Set size_t in_size = 4;.
iconv doesn’t add null terminators for you; it adjusts the pointers it’s given. You’ll want to add your own before calling puts.
*output = '\0';
puts(output_buffer);

Related

In C, how to print UTF-8 char if given its bytes in char variables?

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?
Similarly for the 3 and 4 byte UTF-8 characters?
I've been trying all kinds of approaches with mbstowcs() but I just can't get it to work.

I managed to write a working example.
When c1 is '\xce' and c2 is '\xb8', the result is θ.
It turns out that I have to call setlocale before using mbstowcs.
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
int main()
{
char* localeInfo = setlocale(LC_ALL, "en_US.utf8");
printf("Locale information set to %s\n", localeInfo);
const char c1 = '\xce';
const char c2 = '\xb8';
int byteCount = 2;
char* mbS = (char*) malloc(byteCount + 1);
mbS[0] = c1;
mbS[1] = c2;
mbS[byteCount] = 0; //null terminator
printf("Directly using printf: %s\n", mbS);
int requiredSize = mbstowcs(NULL, mbS, 0);
printf("Output size including null terminator is %d\n\n", requiredSize +1);
wchar_t *wideOutput = (wchar_t *)malloc( (requiredSize +1) * sizeof( wchar_t ));
int len = mbstowcs(wideOutput , mbS, requiredSize +1 );
if(len == -1){
printf("Failed conversion!");
}else{
printf("Converted %d character(s). Result: %ls\n", len, wideOutput );
}
return 0;
}
Output:
Locale information set to en_US.utf8
Directly using printf: θ
Output size including null terminator is 2
Converted 1 character(s). Result: θ
For 3 or 4 byte utf8 characters, one can use a similar approach.

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?
They are already an UTF-8 character. You would just print them.
putchar(c1);
putchar(c2);
It's up to your terminal or whatever device you are using to display the output to properly understand and render the UTF-8 encoding. This is unrelated to encoding used by your program and unrelated to wide characters.
Similarly for the 3 and 4 byte UTF-8 characters?
You would output them.
If your terminal or the device you are sending the bytes to does not understand UTF-8 encoding, then you have to convert the bytes to something the device understands. Typically, you would use an external library for that, like iconv. Alternatively, you could setlocale("C.utf-8") then convert your bytes to wchar_t, then setlocale("C.your_target_encoding") and then convert the bytes to that encoding or output the bytes with %ls. All %ls does (on common systems) is it converts the string back to multibyte and then outputs it. Wide stream outputting to terminal does the same, first converts, then outputs.

How to use iconv(3) to convert wide string to UTF-8?

I'm trying to use iconv(3) to convert a wide-character string to UTF-8 using the code below. When I run the below, the iconv call returns E2BIG, as if there were not enough bytes of space available in the output buffer. This occurs despite the fact that (I think) I have sized the output buffer to admit the worst-case expansion for UTF-8. In fact, given that the input is a simple ASCII 'A' encoded as a wchar_t followed by a zero wchar_t terminator, the output should be exactly two bytes/chars: an 'A' followed by a '\0'.
'man utf-8' on my Linux system says that the maximum length of a UTF-8 byte sequence is 6 bytes, so I believe that for an input buffer of 2 wchar_ts (a character followed by the null terminator), making (on my system) 8 bytes total (since sizeof(wchar_t) == 4), a buffer of 12 bytes (2 * UTF8_SEQUENCE_MAXLEN) should be sufficient.
By experiment, if I increase UTF8_SEQUENCE_MAXLEN to 16, iconv's return value indicates success (15 still fails). But I cannot see any way that any wchar_t value would occupy so many bytes when encoded in UTF-8.
Have I gone wrong in my calculations? Are 16-byte UTF-8 sequences possible? What have I done wrong?
#include <stdio.h>
#include <stdlib.h>
#include <iconv.h>
#include <wchar.h>
#define UTF8_SEQUENCE_MAXLEN 6
/* #define UTF8_SEQUENCE_MAXLEN 16 */
int
main(int argc, char **argv)
{
wchar_t *wcs = L"A";
signed char utf8[(1 /* wcslen(wcs) */ + 1 /* L'\0' */) * UTF8_SEQUENCE_MAXLEN];
char *iconv_in = (char *) wcs;
char *iconv_out = (char *) &utf8[0];
size_t iconv_in_bytes = (wcslen(wcs) + 1 /* L'\0' */) * sizeof(wchar_t);
size_t iconv_out_bytes = sizeof(utf8);
size_t ret;
iconv_t cd;
cd = iconv_open("WCHAR_T", "UTF-8");
if ((iconv_t) -1 == cd) {
perror("iconv_open");
return EXIT_FAILURE;
}
ret = iconv(cd, &iconv_in, &iconv_in_bytes, &iconv_out, &iconv_out_bytes);
if ((size_t) -1 == ret) {
perror("iconv");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}

The arguments to iconv_open are the wrong way around.
The order of arguments is (to, from), not (from, to), as is clearly stated in the manpage.
Consequently, changing
iconv_open("WCHAR_T", "UTF-8");
to
iconv_open("UTF-8", "WCHAR_T");
causes the (otherwise unchanged) code above to work as expected.
D'oh. Need to read manpages more closely.

What's the C library function to generate random string?

Is there a library function that creates a random string in the same way that mkstemp() creates a unique file name? What is it?

There's no standard function, but your OS might implement something. Have you considered searching through the manuals? Alternatively, this task is simple enough. I'd be tempted to use something like:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
void rand_str(char *, size_t);
int main(void) {
char str[] = { [41] = '\1' }; // make the last character non-zero so we can test based on it later
rand_str(str, sizeof str - 1);
assert(str[41] == '\0'); // test the correct insertion of string terminator
puts(str);
}
void rand_str(char *dest, size_t length) {
char charset[] = "0123456789"
"abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ";
while (length-- > 0) {
size_t index = (double) rand() / RAND_MAX * (sizeof charset - 1);
*dest++ = charset[index];
}
*dest = '\0';
}
This has the neat benefit of working correctly on EBCDIC systems, and being able to accommodate virtually any character set. I haven't added any of the following characters into the character set, because it seems clear that you want strings that could be filenames:
":;?#[\]^_`{|}"
I figured many of those characters could be invalid in filenames on various OSes.

There's no build in API, you may use (on *x system) /dev/urandom like:
FILE *f = fopen( "/dev/urandom", "r");
if( !f) ...
fread( binary_string, string_length, f);
fclose(f);
Note that this will create binary data, not string data so you'll may have to filter it afterwards.
You may also use standard pseudorandom generator rand():
#include <time.h>
#include <stdlib.h>
// In main:
srand(time(NULL));
for( int i = 0; i < string_length; ++i){
string[i] = '0' + rand()%72; // starting on '0', ending on '}'
}
And if you need really random string you need to google generating random sequence cryptography which is one of cryptography's difficult problems which still hasn't perfect solution :)

How to get the binary presentation in hex value of a char

Given a char, how to convert this char to a two digit char, which is the hex value of the binary presentation?
For example, given a char, it has a binary presentation, which is one byte, for example, 01010100, which is 0x54.....I need the char array of 54.

Actually it would be:
char c = 84;
char result[3];
sprintf(result,"%02x",c);

This is all far to easy readable :-)
#define H(x) '0' + (x) + ((x)>9) * 7
char c = 84;
char result[3] = { H(c>>4), H(c&15) };

The following code, using snprintf() should work:
#include <stdio.h>
#include <string.h>
int main()
{
char myChar = 'A'; // A = 0x41 = 65
char myHex[3];
snprintf(myHex, 2 "%02x", myChar);
// Print the contents of myHex
printf("myHex = %s\n", myHex);
}
snprintf() is a function that works like printf(), except that it fills a char array with maximum N characters. The syntax of snprintf() is:
int snprintf(char *str, size_t size, const char *format, ...)
Where str is the string to "sprint" to, size is the maximum number of characters to write (in our case, 2), and the rest is like the normal printf()

wchar_t to octets - in C?

I'm trying to store a wchar_t string as octets, but I'm positive I'm doing it wrong - anybody mind to validate my attempt? What's going to happen when one char will consume 4 bytes?
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for (i=0;i< wcslen(wchar1);i++) {
printf("(%d)", (wchar1[i]) & 255);
printf("(%d)", (wchar1[i] >> 8) & 255);
}

Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.
UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.
Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?.

You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h
The prototype is as follows:
#include <stdlib.h>
size_t wcstombs(char *dest, const wchar_t *src, size_t n);
It will correctly convert your wchar_t string provided by src into a char (a.k.a. octets) string and write it to dest with at most n bytes.
char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;
memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);
/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
* before being finished with converting. If the limit WAS reached, the string
* will not be zero terminated and you must do it yourself - not happening here */
for (i = 0; i < length; i++)
printf("Octet #%d: '%02x'\n", i, mb_string[i]);

If you're trying to see the content of the memory buffer holding the string, you can do this:
size_t len = wcslen(str) * sizeof(wchar_t);
const char *ptr = (const char*)(str);
for (i=0; i<len; i++) {
printf("(%u)", ptr[i]);
}

I don't know why printf and wprintf do not work together. Following code works.
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for(i=0; i<wcslen(wchar1); i++)
{
wprintf(L"(%d)", (wchar1[i]) & 255);
wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Converting Unicode codepoints to UTF-8 in C using iconv - c

Related

In C, how to print UTF-8 char if given its bytes in char variables?

How to use iconv(3) to convert wide string to UTF-8?

What's the C library function to generate random string?

How to get the binary presentation in hex value of a char

wchar_t to octets - in C?

Categories

Resources