How to use iconv(3) to convert wide string to UTF-8? - c

I'm trying to use iconv(3) to convert a wide-character string to UTF-8 using the code below. When I run the below, the iconv call returns E2BIG, as if there were not enough bytes of space available in the output buffer. This occurs despite the fact that (I think) I have sized the output buffer to admit the worst-case expansion for UTF-8. In fact, given that the input is a simple ASCII 'A' encoded as a wchar_t followed by a zero wchar_t terminator, the output should be exactly two bytes/chars: an 'A' followed by a '\0'.
'man utf-8' on my Linux system says that the maximum length of a UTF-8 byte sequence is 6 bytes, so I believe that for an input buffer of 2 wchar_ts (a character followed by the null terminator), making (on my system) 8 bytes total (since sizeof(wchar_t) == 4), a buffer of 12 bytes (2 * UTF8_SEQUENCE_MAXLEN) should be sufficient.
By experiment, if I increase UTF8_SEQUENCE_MAXLEN to 16, iconv's return value indicates success (15 still fails). But I cannot see any way that any wchar_t value would occupy so many bytes when encoded in UTF-8.
Have I gone wrong in my calculations? Are 16-byte UTF-8 sequences possible? What have I done wrong?
#include <stdio.h>
#include <stdlib.h>
#include <iconv.h>
#include <wchar.h>
#define UTF8_SEQUENCE_MAXLEN 6
/* #define UTF8_SEQUENCE_MAXLEN 16 */
int
main(int argc, char **argv)
{
wchar_t *wcs = L"A";
signed char utf8[(1 /* wcslen(wcs) */ + 1 /* L'\0' */) * UTF8_SEQUENCE_MAXLEN];
char *iconv_in = (char *) wcs;
char *iconv_out = (char *) &utf8[0];
size_t iconv_in_bytes = (wcslen(wcs) + 1 /* L'\0' */) * sizeof(wchar_t);
size_t iconv_out_bytes = sizeof(utf8);
size_t ret;
iconv_t cd;
cd = iconv_open("WCHAR_T", "UTF-8");
if ((iconv_t) -1 == cd) {
perror("iconv_open");
return EXIT_FAILURE;
}
ret = iconv(cd, &iconv_in, &iconv_in_bytes, &iconv_out, &iconv_out_bytes);
if ((size_t) -1 == ret) {
perror("iconv");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}

The arguments to iconv_open are the wrong way around.
The order of arguments is (to, from), not (from, to), as is clearly stated in the manpage.
Consequently, changing
iconv_open("WCHAR_T", "UTF-8");
to
iconv_open("UTF-8", "WCHAR_T");
causes the (otherwise unchanged) code above to work as expected.
D'oh. Need to read manpages more closely.

Related

In C, how to print UTF-8 char if given its bytes in char variables?

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?
Similarly for the 3 and 4 byte UTF-8 characters?
I've been trying all kinds of approaches with mbstowcs() but I just can't get it to work.
I managed to write a working example.
When c1 is '\xce' and c2 is '\xb8', the result is θ.
It turns out that I have to call setlocale before using mbstowcs.
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
int main()
{
char* localeInfo = setlocale(LC_ALL, "en_US.utf8");
printf("Locale information set to %s\n", localeInfo);
const char c1 = '\xce';
const char c2 = '\xb8';
int byteCount = 2;
char* mbS = (char*) malloc(byteCount + 1);
mbS[0] = c1;
mbS[1] = c2;
mbS[byteCount] = 0; //null terminator
printf("Directly using printf: %s\n", mbS);
int requiredSize = mbstowcs(NULL, mbS, 0);
printf("Output size including null terminator is %d\n\n", requiredSize +1);
wchar_t *wideOutput = (wchar_t *)malloc( (requiredSize +1) * sizeof( wchar_t ));
int len = mbstowcs(wideOutput , mbS, requiredSize +1 );
if(len == -1){
printf("Failed conversion!");
}else{
printf("Converted %d character(s). Result: %ls\n", len, wideOutput );
}
return 0;
}
Output:
Locale information set to en_US.utf8
Directly using printf: θ
Output size including null terminator is 2
Converted 1 character(s). Result: θ
For 3 or 4 byte utf8 characters, one can use a similar approach.
If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?
They are already an UTF-8 character. You would just print them.
putchar(c1);
putchar(c2);
It's up to your terminal or whatever device you are using to display the output to properly understand and render the UTF-8 encoding. This is unrelated to encoding used by your program and unrelated to wide characters.
Similarly for the 3 and 4 byte UTF-8 characters?
You would output them.
If your terminal or the device you are sending the bytes to does not understand UTF-8 encoding, then you have to convert the bytes to something the device understands. Typically, you would use an external library for that, like iconv. Alternatively, you could setlocale("C.utf-8") then convert your bytes to wchar_t, then setlocale("C.your_target_encoding") and then convert the bytes to that encoding or output the bytes with %ls. All %ls does (on common systems) is it converts the string back to multibyte and then outputs it. Wide stream outputting to terminal does the same, first converts, then outputs.

Function is returning a different value every time?

I'm trying to convert a hexadecimal INT to a char so I could convert it into a binary to count the number of ones in it. Here's my function to convert it into char:
#include <stdio.h>
#include <stdlib.h>
#define shift(a) a=a<<5
#define parity_even(a) a = a+0x11
#define add_msb(a) a = a + 8000
void count_ones(int hex){
char *s = malloc(2);
sprintf(s, "0x%x", hex);
free(s);
printf("%x", s);
};
int main() {
int a = 0x01B9;
shift(a);
parity_even(a);
count_ones(a);
return 0;
}
Every time I run this, i always get different outputs but the first three hex number are always the same. Example of outputs:
8c0ba2a0
fc3b92a0
4500a2a0
d27e82a0
c15d62a0
What exactly is happening here? I allocated 2 bytes for the char since my hex int is 2 bytes.
It's too long to write a comment so here goes:
I'm trying to convert a hexadecimal INT
int are stored as a group of value, padding (possible empty) and sign bits, so is there no such thing as a hexadecimal INT but you can represent (print) a given number in the hexadecimal format.
convert a ... INT to a char
That would be lossy conversion as an int might have 4 bytes of data that you are trying to cram into a 1 byte. char specifically may be signed or unsigned. You probably mean string (generic term) or char [] (standard way to represent a string in C).
binary to count the number of ones
That's the real issue you are trying to solve and this is a duplicate of:
How to count the number of set bits in a 32-bit integer?
count number of ones in a given integer using only << >> + | & ^ ~ ! =
To address the question you ask:
Need to allocate more than 2 bytes. Specifically ceil(log16(hex)) + 2 (for 0x) + 1 (for trailing '\0').
One way to get the size is to just ask snprintf(s, 0, ...)
then allocate a suitable array via malloc (see first implementation below) or use stack allocated variable length array (VLA).
You can use INT_MAX instead of hex to get an upper
bound. log16(INT_MAX) <= CHAR_BIT * sizeof(int) / 4 and the
latter is a compile time constant. This means you can allocate your string on stack (see 2nd implementation below).
It's undefined behavior to use a variable after it's deallocated. Move free() to after the last use.
Here is one of the dynamic versions mentioned above:
void count_ones(unsigned hex) {
char *s = NULL;
size_t n = snprintf(s, 0, "0x%x", hex) + 1;
s = malloc(n);
if(!s) return; // memory could not be allocated
snprintf(s, n, "0x%x", hex);
printf("%s (size = %zu)", s, n);
free(s);
};
Note, I initialized s to NULL which would cause the first call to snprintf() to return an undefined value on SUSv2 (legacy). It's well defined on c99 and later. The output is:
0x3731 (size = 7)
And the compile-time version using a fixed upper bound:
#include <limits.h>
// compile-time
void count_ones(unsigned hex) {
char s[BIT_CHAR * sizeof(int) / 4 + 3];
sprintf(s, "0x%x", hex);
printf("%s (size = %zu)", s, n);
};
and the output is:
0x3731 (size = 11)
Your biggest problem is that malloc isn't allocating enough. As Barmar said, you need at least 7 bytes to store it or you could calculate the amount needed. Another problem is that you are freeing it and then using it. It is only one line after the free that you use it again, which shouldn't have anything bad happen like 99.9% of the time, but you should always free after you know you are done using it.

Converting Unicode codepoints to UTF-8 in C using iconv

I want to convert a 32-bit value, which represents a Unicode codepoint, into a sequence of chars which is the utf-8 encoded string containing only the character corresponding to the codepoint.
For example, I want to turn the value 955 into the utf-8 encoded string "λ".
I tried to do this using iconv, but I could not get the desired result. Here is the code that I wrote:
#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
int main(void)
{
uint32_t codepoint = U'λ';
char *input = (char *) &codepoint;
size_t in_size = 2; // lower-case lambda is a 16-bit character (0x3BB = 955)
char output_buffer[10];
char *output = output_buffer;
size_t out_size = 10;
iconv_t cd = iconv_open("UTF-8", "UTF-32");
iconv(cd, &input, &in_size, &output, &out_size);
puts(output_buffer);
return 0;
}
When I run it, only a newline is printed (puts automatically prints a newline,-- the first byte of outout_buffer is '\0').
What is wrong with my understanding or my implementation?
As said by minitech, you must use size = 4 for UTF32 in an uint32_t, and you must preset the buffer to null to have the terminating null after conversion.
This code works on Ubuntu :
#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
#include <memory.h>
int main(void)
{
uint32_t codepoint = 955;
char *input = (char *) &codepoint;
size_t in_size = 4; // lower-case lambda is a 16-bit character (0x3BB = 955)
char output_buffer[10];
memset(output_buffer, 0, sizeof(output_buffer));
char *output = output_buffer;
size_t out_size = 10;
iconv_t cd = iconv_open("UTF-8", "UTF-32");
iconv(cd, &input, &in_size, &output, &out_size);
puts(output_buffer);
return 0;
}
Two problems:
Since you’re using UTF-32, you need to specify 4 bytes. The “lower-case lambda is a 16-bit character (0x3BB = 955)” comment isn’t true for a 4-byte fixed-width encoding; it’s 0x000003bb. Set size_t in_size = 4;.
iconv doesn’t add null terminators for you; it adjusts the pointers it’s given. You’ll want to add your own before calling puts.
*output = '\0';
puts(output_buffer);

C Programming : how do I read and print out a byte from a binary file?

I wish to open a binary file, to read the first byte of the file and finally to print the hex value (in string format) to stdout (ie, if the first byte is 03 hex, I wish to print out 0x03 for example). The output I get does not correspond with what I know to be in my sample binary, so I am wondering if someone can help with this.
Here is the code:
#include <stdio.h>
#include <fcntl.h>
int main(int argc, char* argv[])
{
int fd;
char raw_buf[1],str_buf[1];
fd = open(argv[1],O_RDONLY|O_BINARY);
/* Position at beginning */
lseek(fd,0,SEEK_SET);
/* Read one byte */
read(fd,raw_buf,1);
/* Convert to string format */
sprintf(str_buf,"0x%x",raw_buf);
printf("str_buf= <%s>\n",str_buf);
close (fd);
return 0;
}
The program is compiled as follows:
gcc rd_byte.c -o rd_byte
and run as follows:
rd_byte BINFILE.bin
Knowing that the sample binary file used has 03 as its first byte, I get the output:
str_buf= <0x22cce3>
What I expect is
str_buf= <0x03>
Where is the error in my code?
Thank you for any help.
You're printing the value of the pointer raw_buf, not the memory at that location:
sprintf(str_buf,"0x%x",raw_buf[0]);
As Andreas said, str_buf is also not big enough. But: no need for a second buffer, you could just call printf directly.
printf("0x%x",raw_buf[0]);
Less is more...
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char* argv[]) {
int fd;
unsigned char c;
/* needs error checking */
fd = open(argv[1], O_RDONLY);
read(fd, &c, sizeof(c));
close(fd);
printf("<0x%x>\n", c);
return 0;
}
seeking is not needed
if you want to read a byte use an unsigned char
printf will do the format
I think that you are overcomplicating things and using non-portable constructs where they aren't really necessary.
You should be able to just do:
#include <stdio.h>
int main(int argc, char** argv)
{
if (argc < 2)
return 1; /* TODO: better error handling */
FILE* f = fopen(argv[1], "rb");
/* TODO: check f is not NULL */
/* Read one byte */
int first = fgetc(f);
if (first != EOF)
printf("first byte = %x\n", (unsigned)first);
/* TODO else read failed, empty file?? */
fclose(f);
return 0;
}
str_buf has a maximum size of 1 (char str_buf[1];), it should at least 5 bytes long (4 for XxXX plus the \0).
Moreover, change
sprintf(str_buf,"0x%x",raw_buf);
to
sprintf(str_buf,"0x%x",*raw_buf);
otherwise you'll print the address of the raw_buf pointer, instead of its value (that you obtain by dereferencing the pointer).
Finally, make sure both raw_buf is unsigned. The standard specified that the signness of chars (where not explicitly specified) is implementation defined, ie, every implementation decides whether they should be signed or not. In practice, on most implementations they are signed by default unless you're compiling with a particular flag. When dealing with bytes always make sure they are unsigned; otherwise you'll get surprising results should you want to convert them to integers.
Using the information from the various responses above (thank you all!) I would like to post this piece of code which is a trimmed down version of what I finally used.
There is however a difference between what the following code does and what was described in my origal question : this code does not read the first byte of the binary file header as described originally, but instead reads the 11th and 12th bytes (offsets 10 & 11) of the input binary file (a .DBF file). The 11th and 12th bytes contain the length of a data record (this is what I want to know in fact) with the Least Significant Byte positioned first: for example, if the 11th and 12th bytes are respectivly : 0x06 0x08, then the length of a data record would be 0x0806 bytes, or 2054bytes in decimal
#include <stdio.h>
#include <fcntl.h>
int main(int argc, char* argv[]) {
int fd, dec;
unsigned char c[1];
unsigned char hex_buf[6];
/* No error checking, etc. done here for brevity */
/* Open the file given as the input argument */
fd = open(argv[1], O_RDONLY);
/* Position ourselves on the 11th byte aka offset 10 of the input file */
lseek(fd,10,SEEK_SET);
/* read 2 bytes into memory location c */
read(fd, &c, 2*sizeof(c));
/* write the data at c to the buffer hex_buf in the required (reverse) byte order + formatted */
sprintf(hex_buf,"%.2x%.2x",c[1],c[0]);
printf("Hexadecimal value:<0x%s>\n", hex_buf);
/* copy the hex data in hex_buf to memory location dec, formatting it into decimal */
sscanf(hex_buf, "%x", &dec);
printf("Answer: Size of a data record=<%u>\n", dec);
return 0;
}

Is it possible to use a Unicode "argv"?

I'm writing a little wrapper for an application that uses files as arguments.
The wrapper needs to be in Unicode, so I'm using wchar_t for the characters and strings I have. Now I find myself in a problem, I need to have the arguments of the program in a array of wchar_t's and in a wchar_t string.
Is it possible? I'm defining the main function as
int main(int argc, char *argv[])
Should I use wchar_t's for argv?
Thank you very much, I seem not to find useful info on how to use Unicode properly in C.
In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.
Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.
On a Mac (10.5.8, Leopard), I got:
Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A ......
0x0006:
Osiris JL:
That's all UTF-8 encoded. (odx is a hex dump program).
See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment
Portable code doesn't support it. Windows (for example) supports using wmain instead of main, in which case argv is passed as wide characters.
On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce an argv-style wchar_t[] array, even if the app is not compiled for Unicode.
On Windows anyway, you can have a wmain() for UNICODE builds. Not portable though. I dunno if GCC or Unix/Linux platforms provide anything similar.
Assuming that your Linux environment uses UTF-8 encoding then the following code will prepare your program for easy Unicode treatment in C++:
int main(int argc, char * argv[]) {
std::setlocale(LC_CTYPE, "");
// ...
}
Next, wchar_t type is 32-bit in Linux, which means it can hold individual Unicode code points and you can safely use wstring type for classical string processing in C++ (character by character). With setlocale call above, inserting into wcout will automatically translate your output into UTF-8 and extracting from wcin will automatically translate UTF-8 input into UTF-32 (1 character = 1 code point). The only problem that remains is that argv[i] strings are still UTF-8 encoded.
You can use the following function to decode UTF-8 into UTF-32. If the input string is corrupted it will return properly converted characters until the place where the UTF-8 rules were broken. You could improve it if you need more error reporting. But for argv data one can safely assume that it is correct UTF-8:
#define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))
wstring Convert(const char * s) {
typedef unsigned char byte;
struct Level {
byte Head, Data, Null;
Level(byte h, byte d) {
Head = h; // the head shifted to the right
Data = d; // number of data bits
Null = h << d; // encoded byte with zero data bits
}
bool encoded(byte b) { return b>>Data == Head; }
}; // struct Level
Level lev[] = {
Level(2, 6),
Level(6, 5),
Level(14, 4),
Level(30, 3),
Level(62, 2),
Level(126, 1)
};
wchar_t wc = 0;
const char * p = s;
wstring result;
while (*p != 0) {
byte b = *p++;
if (b>>7 == 0) { // deal with ASCII
wc = b;
result.push_back(wc);
continue;
} // ASCII
bool found = false;
for (int i = 1; i < ARR_LEN(lev); ++i) {
if (lev[i].encoded(b)) {
wc = b ^ lev[i].Null; // remove the head
wc <<= lev[0].Data * i;
for (int j = i; j > 0; --j) { // trailing bytes
if (*p == 0) return result; // unexpected
b = *p++;
if (!lev[0].encoded(b)) // encoding corrupted
return result;
wchar_t tmp = b ^ lev[0].Null;
wc |= tmp << lev[0].Data*(j-1);
} // trailing bytes
result.push_back(wc);
found = true;
break;
} // lev[i]
} // for lev
if (!found) return result; // encoding incorrect
} // while
return result;
} // wstring Convert
On Windows, you can use tchar.h and _tmain, which will be turned into wmain if the _UNICODE symbol is defined at compile time, or main otherwise. TCHAR *argv[] will similarly be expanded to WCHAR * argv[] if unicode is defined, and char * argv[] if not.
If you want to have your main method work cross platform, you can define your own macros to the same effect.
TCHAR.h contains a number of convenience macros for conversion between wchar and char.

Resources