How to dump utf8 from octal ISO-8859-1 in C - c

I'm trying to output the right character in utf8 given the following octal sequence \303\255 and \346\234\254, but I don't get the correct output.
#include <stdio.h>
#include <stdlib.h>
int encode(char *buf, unsigned char ch){
if(ch < 0x80) {
*buf++ = (char)ch;
return 1;
}
if(ch < 0x800) {
*buf++ = (ch >> 6) | 0xC0;
*buf++ = (ch & 0x3F) | 0x80;
return 2;
}
if(ch < 0x10000) {
*buf++ = (ch >> 12) | 0xE0;
*buf++ = ((ch >> 6) & 0x3F) | 0x80;
*buf++ = (ch & 0x3F) | 0x80;
return 3;
}
if(ch < 0x110000) {
*buf++ = (ch >> 18) | 0xF0;
*buf++ = ((ch >> 12) & 0x3F) | 0x80;
*buf++ = ((ch >> 6) & 0x3F) | 0x80;
*buf++ = (ch & 0x3F) | 0x80;
return 4;
}
return 0;
}
void output (char *str) {
char *buffer = calloc(8, sizeof(char));
int n = 0;
while(*str) {
n = encode(buffer + n, *str++);
}
printf("%s\n", buffer);
free (buffer);
}
int main() {
char *str1 = "\303\255";
char *str2 = "\346\234\254";
output(str1);
output(str2);
return 0;
}
Outputs: í & æ¬ instead of í & 本

The problem is that the code sequence you use is already UTF-8
/* Both of these are already UTF-8 chars. */
char *str1 = "\303\255";
char *str2 = "\346\234\254";
So your encode function is trying to encode an already encoded UTF-8 which should not work.
When i print these sequences in my UTF-8 enabled terminal i see what you are expecting to see:
$ printf "%s\n" $'\303\255'
í
$ printf "%s\n" $'\346\234\254'
本
So maybe you need to rethink what you are trying to accomplish and post a new question if you have new problems there.

It's a pity, but you cannot compare a char value (being it signed or unsigned) with values over 0x100. You are missing something if you try to convert one byte (iso-8859-1) values to utf-8. The iso-8859-1 characters have the same code values as their UTF counterparts, so the conversion is fairly straightforward, as will be shown below.
First of all, all the iso-8859-1 characters are the same as their UTF counterparts, so the first transformation is the identity: We convert each value in iso-8859-1 to the same value in UTF (look that when I say UTF y mean the UTF code for that character, without using any codification, as when I say UTF-8, which is actually an encoding of UTF in eight bit bytes)
UTF values in the range 0x80...0xff must be encoded with two bytes, the first byte using bits 7 and 6 with pattern 110000xx being xx the two most significant bits of the input code, and followed by a second byte with 10xxxxxx being xxxxxx the six least significant bits (bits 5 to 0) of the input code. For UTF values in the range 0x00...0x7f you encode them with just the same byte as the UTF code.
The following function does preciselly this:
size_t iso2utf( unsigned char *buf, unsigned char iso )
{
size_t res = 0;
if ( iso & 0x80 ) {
*buf++ = 0xc0 | (iso >> 6); /* the 110000xx part */
*buf++ = 0x80 | (iso & 0x3f); /* ... and the 10xxxxxx part. */
res += 2;
} else {
*buf++ = iso; /* a 0xxxxxxx character, untouched. */
res++;
}
*buf = '\0';
return res;
} /* iso2utf */
If you want a complete UTF into UTF-8 encoder, you can try this (I used a different approach, as there can be as much as seven bytes per UTF char ---actually not so much, as currently only 24 or 25 bit codes are used):
#include <string.h>
#include <stdlib.h>
typedef unsigned int UTF; /* you can use wchar_t if you prefer */
typedef unsigned char BYTE;
/* I will assume that UTF string is also zero terminated */
size_t utf_utf8 (BYTE *out, UTF *in)
{
size_t res = 0;
for (;*in;in++) {
UTF c = *in; /* copy the UTF value */
/* we are constructing the string backwards, so finally
* we have it properly ordered. */
size_t n = 0; /* number of characters for this one */
BYTE aux[7], /* buffer to construct the string */
*p = aux + sizeof aux; /* point one cell past the end */
static UTF limits[] = { 0x80, 0x20, 0x10, 0x08, 0x4, 0x2, 0x01};
static UTF masks[] = { 0x00, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe};
for (;c >= limits[n]; c >>= 6) {
*--p = 0x80 | (c & 0x3f); n++;
} /* for */
*--p = masks[n] | c; n++;
memcpy(out, p, n); out += n; res += n;
} /* for */
*out = '\0'; /* terminate string */
return res;
} /* utf_utf8 */
See that the seven bytes per UTF code is hardwired, as it is the fact of UTF codes being 32bit integer. I don't expect UTF codes to go further past the 32 bit limit, but in that case, both, the UTF typedef, and the sizes and contents of the tables aux, limits and masks might be changed accordingly. There's a maximum limit of 7 or 8 for the number of characters used for the utf-8 encoding also, and it's not specified in the standard in any form how to proceed if the UTF codespace should run out of codes any time, so better not to mesh too much with this.

Useless function parameter: unsigned char ch
/// In the following bad code, `if(ch < 0x10000)` is never true
int encode(char *buf, unsigned char ch){
if(ch < 0x80) {
...
return 1;
if(ch < 0x800) {
...
return 2;
if(ch < 0x10000) {
Sorry, GTG.
Note: Code incorrectly does not detect high and low surrogates.

Related

Why does multibyte character to char32_t conversion use UTF-8 as the multibyte encoding instead of the locale-specific one?

I have been trying to convert Chinese character input from Windows command prompt in Big5 to UTF-8 by first converting the received input to char32_t in UTF-32 encoding, then convert it to UTF-8. I've been calling the function mbtoc32 from <uchar.h> to do this job, however it kept sending "Encoding error".
The following is the conditions I have encountered:
Converting the sequence (Big5) to a wchar_t representation by mbstowcs is successful.
mbrtoc32 takes the multibyte sequence as UTF-8, though the locale is not. (Set to "", returns "Chinese (Traditional)_Hong Kong SAR.950" on my machine)
Below is the code I've been writing to try to debug my problem, however no success. It tries to convert the "香" Chinese character (U+9999) into the multibyte representation, then tries to convert the Big5 encoding of "香" (0xADBB) into wchar_t and char32_t. However, converting from multibyte (Big5) to char32_t returns encoding error. (In contradictory, inputting the UTF-8 sequence of "香" to mbrtoc32 does return 0x9999 successfully)
#include <uchar.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
mbstate_t state;
int main(void){
setlocale(LC_CTYPE, "");
printf("Your locale is: %s\n", setlocale(LC_CTYPE, NULL));
char32_t chi_c = 0x9999;
printf("Character U+9999 is 香\n");
char *mbc = (char *)calloc(32, sizeof(char));
size_t mb_len;
mb_len = c32rtomb(mbc, chi_c, &state);
int i;
printf("The multibyte representation of U+9999 is:\n");
// 0xE9A699, UTF-8
for (i = 0; i < mb_len; i++){
printf("%#2x\t", *(mbc + i));
}
char *src_mbs = (char *)calloc(32, sizeof(char));
// "香" in Big5 encoding
*(src_mbs + 0) = 0xad;
*(src_mbs + 1) = 0xbb;
wchar_t res_wc;
mbtowc(&res_wc, src_mbs, 32); // Success, res_wc == 0x9999
char32_t res_c32;
mb_len = mbrtoc32(&res_c32, src_mbs, (size_t)3, &state);
// Returns (size_t)-1, encoding error
if (mb_len == (size_t)-1){
perror("Encoding error");
return errno;
}
else {
printf("\nThe 32-bit character representation of U+9999 is:\n%#x", res_wc);
}
return 0;
}
I've also read documentation from cppreference.com, it said,
In any case, the multibyte character encoding used by this function is specified by the currently active C locale.
I expect mbrtoc32 to behave like mbtowc, which is converting the character from the locale-specific encoding to UTF-32 (in this case Big5 to UTF-32).
Is there any solutions to use mbrtoc32 to convert the multibyte character into char32_t without having the "Encoding error"?
P.S.: I'm using Mingw-64 on Windows 10, compiled with gcc.
I've found the problem. The Mingw-w64 I'm using is expecting all multi-byte string passed to mbrtoc32 and c32rtomb to be in UTF-8 encoding.
Code for mbrtoc32:
size_t mbrtoc32 (char32_t *__restrict__ pc32,
const char *__restrict__ s,
size_t n,
mbstate_t *__restrict__ __UNUSED_PARAM(ps))
{
if (*s == 0)
{
*pc32 = 0;
return 0;
}
/* ASCII character - high bit unset */
if ((*s & 0x80) == 0)
{
*pc32 = *s;
return 1;
}
/* Multibyte chars */
if ((*s & 0xE0) == 0xC0) /* 110xxxxx needs 2 bytes */
{
if (n < 2)
return (size_t)-2;
*pc32 = ((s[0] & 31) << 6) | (s[1] & 63);
return 2;
}
else if ((*s & 0xf0) == 0xE0) /* 1110xxxx needs 3 bytes */
{
if (n < 3)
return (size_t)-2;
*pc32 = ((s[0] & 15) << 12) | ((s[1] & 63) << 6) | (s[2] & 63);
return 3;
}
else if ((*s & 0xF8) == 0xF0) /* 11110xxx needs 4 bytes */
{
if (n < 4)
return (size_t)-2;
*pc32 = ((s[0] & 7) << 18) | ((s[1] & 63) << 12) | ((s[2] & 63) << 6) | (s[4] & 63);
return 4;
}
errno = EILSEQ;
return (size_t)-1;
}
and for c32rtomb:
size_t c32rtomb (char *__restrict__ s,
char32_t c32,
mbstate_t *__restrict__ __UNUSED_PARAM(ps))
{
if (c32 <= 0x7F) /* 7 bits needs 1 byte */
{
*s = (char)c32 & 0x7F;
return 1;
}
else if (c32 <= 0x7FF) /* 11 bits needs 2 bytes */
{
s[1] = 0x80 | (char)(c32 & 0x3F);
s[0] = 0xC0 | (char)(c32 >> 6);
return 2;
}
else if (c32 <= 0xFFFF) /* 16 bits needs 3 bytes */
{
s[2] = 0x80 | (char)(c32 & 0x3F);
s[1] = 0x80 | (char)((c32 >> 6) & 0x3F);
s[0] = 0xE0 | (char)(c32 >> 12);
return 3;
}
else if (c32 <= 0x1FFFFF) /* 21 bits needs 4 bytes */
{
s[3] = 0x80 | (char)(c32 & 0x3F);
s[2] = 0x80 | (char)((c32 >> 6) & 0x3F);
s[1] = 0x80 | (char)((c32 >> 12) & 0x3F);
s[0] = 0xF0 | (char)(c32 >> 18);
return 4;
}
errno = EILSEQ;
return (size_t)-1;
}
both of these functions expected the given multi-byte string to be in UTF-8 without considering the locale settings. Functions mbrtoc32 and c32rtomb on glibc simply calls their wide character counterpart to convert the characters. As
wide character convertions are working properly on Mingw-w64, I used mbrtowc and wcrtomb to replace mbrtoc32 and c32rtomb respectively like the way on glibc:
#include <uchar.h>
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
mbstate_t state;
int main(void){
setlocale(LC_CTYPE, "");
printf("Your locale is: %s\n", setlocale(LC_CTYPE, NULL));
char *src_mbs = "\xad\xbb"; // "香" in Big5 encoding
char32_t src_c32 = 0x9999; // "香" code point
unsigned char *r_mbc = (char *)calloc(32, sizeof(char));
if (r_mbc == NULL){
perror("Failed to allocate memory");
return errno;
}
size_t mb_len = wcrtomb(r_mbc, (wchar_t)src_c32, &state); // Returns 0xADBB, Big5 of "香", OK
printf("Character U+9999 is %s, ( ", r_mbc);
for (int i = 0; i < mb_len; i++){
printf("%#hhx ", *(r_mbc + i));
}
printf(")\n");
// mb_len = c32rtomb(r_mbc, src_c32, &state); // Returns 0xE9A699, UTF-8 representation of "香", expected Big5
// printf("\nThe multibyte representation of U+9999 is:\n");
// for (i = 0; i < mb_len; i++){
// printf("%#hhX\t", *(r_mbc + i));
// }
char32_t r_c32 = 0;
// mb_len = mbrtoc32(&r_c32, src_mbs, (size_t)3, &state);
// Returns (size_t)-1, encoding error
mb_len = mbrtowc((wchar_t *)&r_c32, src_mbs, (size_t)3, &state); // Returns 0x9999, OK
if (mb_len == (size_t)-1){
perror("Encoding error");
return errno;
}
else {
printf("\nThe 32-bit character representation of U+9999 is:\n%#x", r_c32);
}
return 0;
}

endian-independent base64_encode/decode function

I was googling around for these two C functions that I happen to need, and the cleanest I came across was http://fm4dd.com/programming/base64/base64_stringencode_c.htm But it looks to me like the following little part of it...
void decodeblock(unsigned char in[], char *clrstr) {
unsigned char out[4];
out[0] = in[0] << 2 | in[1] >> 4;
out[1] = in[1] << 4 | in[2] >> 2;
out[2] = in[2] << 6 | in[3] >> 0;
out[3] = '\0';
strncat(clrstr, out, sizeof(out));
}
...is going to be endian-dependent (ditto a corresponding encodeblack() that you can see at the above url). But it's otherwise nice and clean, unlike some of the others: one had three of its own header files, another called its own special malloc()-like function, etc. Anybody know of a nice, small, clean (no headers, no dependencies, etc) version, like this one, that's more architecture-independent?
Edit reason I'm looking for this is that base64_encode() will be done in a php script that's part of an html page, passing that encoded string to an executed cgi program on a far-away box. And that cgi then has to base64_decode() it. So architecture-independence is just an added safety, just in case the cgi's running on a non-intel big-endian box (intel's little).
Edit as per comment below, here's the complete code along with a few changes I made...
/* downloaded from...
http://fm4dd.com/programming/base64/base64_stringencode_c.htm */
/* ------------------------------------------------------------------------ *
* file: base64_stringencode.c v1.0 *
* purpose: tests encoding/decoding strings with base64 *
* author: 02/23/2009 Frank4DD *
* *
* source: http://base64.sourceforge.net/b64.c for encoding *
* http://en.literateprograms.org/Base64_(C) for decoding *
* ------------------------------------------------------------------------ */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* ---- Base64 Encoding/Decoding Table --- */
char b64[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
/* decodeblock - decode 4 '6-bit' characters into 3 8-bit binary bytes */
void decodeblock(unsigned char in[], char *clrstr) {
unsigned char out[4];
out[0] = in[0] << 2 | in[1] >> 4;
out[1] = in[1] << 4 | in[2] >> 2;
out[2] = in[2] << 6 | in[3] >> 0;
out[3] = '\0';
strncat(clrstr, out, sizeof(out));
} /* --- end-of-function decodeblock() --- */
char *base64_decode(char *b64src /*, char *clrdst */) {
static char clrdstbuff[8192];
char *clrdst = clrdstbuff;
int c, phase, i;
unsigned char in[4];
char *p;
clrdst[0] = '\0';
phase = 0; i=0;
while(b64src[i]) {
c = (int) b64src[i];
if(c == '=') {
decodeblock(in, clrdst);
break; }
p = strchr(b64, c);
if(p) {
in[phase] = p - b64;
phase = (phase + 1) % 4;
if(phase == 0) {
decodeblock(in, clrdst);
in[0]=in[1]=in[2]=in[3]=0; }
} /* --- end-of-if(p) --- */
i++;
} /* --- end-of-while(b64src[i]) --- */
return ( clrdstbuff );
} /* --- end-of-function base64_decode() --- */
/* encodeblock - encode 3 8-bit binary bytes as 4 '6-bit' characters */
void encodeblock( unsigned char in[], char b64str[], int len ) {
unsigned char out[5];
out[0] = b64[ in[0] >> 2 ];
out[1] = b64[ ((in[0] & 0x03) << 4) | ((in[1] & 0xf0) >> 4) ];
out[2] = (unsigned char) (len > 1 ? b64[ ((in[1] & 0x0f) << 2) |
((in[2] & 0xc0) >> 6) ] : '=');
out[3] = (unsigned char) (len > 2 ? b64[ in[2] & 0x3f ] : '=');
out[4] = '\0';
strncat(b64str, out, sizeof(out));
} /* --- end-of-function encodeblock() --- */
/* encode - base64 encode a stream, adding padding if needed */
char *base64_encode(char *clrstr /*, char *b64dst */) {
static char b64dstbuff[8192];
char *b64dst = b64dstbuff;
unsigned char in[3];
int i, len = 0;
int j = 0;
b64dst[0] = '\0';
while(clrstr[j]) {
len = 0;
for(i=0; i<3; i++) {
in[i] = (unsigned char) clrstr[j];
if(clrstr[j]) {
len++; j++; }
else in[i] = 0;
} /* --- end-of-for(i) --- */
if( len ) {
encodeblock( in, b64dst, len ); }
} /* --- end-of-while(clrstr[j]) --- */
return ( b64dstbuff );
} /* --- end-of-function base64_encode() --- */
#ifdef TESTBASE64
int main( int argc, char *argv[] ) {
char *mysrc = (argc>1? argv[1] : "My bonnie is over the ocean ");
char *mysrc2 = (argc>2? argv[2] : "My bonnie is over the sea ");
char myb64[2048]="", myb642[2048]="";
char mydst[2048]="", mydst2[2048]="";
char *base64_enclode(), *base64_decode();
int testnum = 1;
if ( strncmp(mysrc,"test",4) == 0 )
testnum = atoi(mysrc+4);
if ( testnum == 1 ) {
strcpy(myb64,base64_encode(mysrc));
printf("The string [%s]\n\tencodes into base64 as: [%s]\n",mysrc,myb64);
strcpy(myb642,base64_encode(mysrc2));
printf("The string [%s]\n\tencodes into base64 as: [%s]\n",mysrc2,myb642);
printf("...\n");
strcpy(mydst,base64_decode(myb64));
printf("The string [%s]\n\tdecodes from base64 as: [%s]\n",myb64,mydst);
strcpy(mydst2,base64_decode(myb642));
printf("The string [%s]\n\tdecodes from base64 as: [%s]\n",myb642,mydst2);
} /* --- end-of-if(testnum==1) --- */
if ( testnum == 2 ) {
strcpy(mydst,base64_decode(mysrc2)); /* input is b64 */
printf("The string [%s]\n\tdecodes from base64 as: [%s]\n",mysrc2,mydst);
} /* --- end-of-if(testnum==2) --- */
if ( testnum == 3 ) {
int itest, ntests = (argc>2?atoi(argv[2]):999);
int ichar, nchars = (argc>3?atoi(argv[3]):128);
unsigned int seed = (argc>4?atoi(argv[4]):987654321);
char blanks[999] = " ";
srand(seed);
for ( itest=1; itest<=ntests; itest++ ) {
for ( ichar=0; ichar<nchars; ichar++ ) mydst[ichar] = 1+(rand()%255);
mydst[nchars] = '\000';
if ( strlen(blanks) > 0 ) strcat(mydst,blanks);
strcpy(myb64,base64_encode(mydst));
strcpy(mydst2,base64_decode(myb64));
if ( strcmp(mydst,mydst2) != 0 )
printf("Test#%d:\n\t in=%s\n\tout=%s\n",itest,mydst,mydst2);
} /* --- end-of-for(itest) --- */
} /* --- end-of-if(testnum==3) --- */
return 0;
} /* --- end-of-function main() --- */
#endif
No, it is not endian-dependent. Base64 in itself is 4 bytes to 3 bytes encoding, and doesn't care about the actual representation in memory. However, if you expect to transfer little/big endian data, you must normalize the endianness before encoding and after decoding.
That fragment just addresses all bytes independently. It would be endian-dependent if it loaded 4 bytes in uint32_t or so and using some bit twiddling produced an output that would be copied into the result buffer as is.
However that code is dangerously broken with its strncat and wouldn't work with embedded NUL bytes. Instead you should use something like
void decodeblock(unsigned char in[], unsigned char **clrstr) {
*((*clrstr) ++) = in[0] << 2 | in[1] >> 4;
*((*clrstr) ++) = in[1] << 4 | in[2] >> 2;
*((*clrstr) ++) = in[2] << 6 | in[3] >> 0;
}
which would work with embedded NULs.
In terms of endianess and having code compatible on platforms of differing endianess...
Firstly, there is the endianess of the processing platform hardware, the endianess of the data being transmitted and the endianess of the base64 encoding/decoding process.
The endianess of the base64 coding determines whether, to form the first character, we take the lower 6-bits of the first byte, or the upper 6-bits of the first byte. It appears base64 uses the latter which is big-endian format.
You will need your encoder/decoder to match regardless of the platform, so the code you show with the fixed bit-shifting is already going to work on either big or little endian platforms. You don't want your little endian platform to use little-endian bit shifting by placing the lower 6-bits of the first byte into the first encoded character. If it did, it would not be compatible with other platforms, so in this case you don't want platform dependent code.
However, when it comes to the data, you may need to convert the endianess, but do this with the binary data, not as part of the base64 coding or the encoded text.

How to print Unicode codepoints as characters in C?

I have an array of uint32_t elements that each store a codepoint for a non-latin Unicode character. How do I print them on the console or store them in a file as UTF-8 encoded characters? I understand that they may fail to render properly on a console, but they should display fine if I open them in a compatible editor.
I have tried using wprintf(L"%lc", UINT32_T_VARIABLE), and fwprintf(FILE_STREAM, L"%lc", UINT32_T_VARIABLE) but to no avail.
You must first select the proper locale with:
#include <locale.h>
setlocale(LC_ALL, "C.UTF-8");
or
setlocale(LC_ALL, "en_US.UTF-8");
And then use printf or fprintf with the %lc format:
printf("%lc", UINT32_T_VARIABLE);
This will work only for Unicode code points small enough to fit in a wchar_t. For a more complete and portable solution, you may nee to implement the Unicode to UTF-8 conversion yourself, which is not very difficult.
Best to use existing code when available.
Rolling ones own Unicode code-point to UTF8 is simply, yet easy to mess up. The answer took 2 edits to fix. #Jonathan Leffler #chqrlie, so rigorous testing is recommended for any self-coded solution. Follows is lightly tested code to convert a code-point to an array.
Note that the result is not a string.
// Populate utf8 with 0-4 bytes
// Return length used in utf8[]
// 0 implies bad codepoint
unsigned Unicode_CodepointToUTF8(uint8_t *utf8, uint32_t codepoint) {
if (codepoint <= 0x7F) {
utf8[0] = codepoint;
return 1;
}
if (codepoint <= 0x7FF) {
utf8[0] = 0xC0 | (codepoint >> 6);
utf8[1] = 0x80 | (codepoint & 0x3F);
return 2;
}
if (codepoint <= 0xFFFF) {
// detect surrogates
if (codepoint >= 0xD800 && codepoint <= 0xDFFF) return 0;
utf8[0] = 0xE0 | (codepoint >> 12);
utf8[1] = 0x80 | ((codepoint >> 6) & 0x3F);
utf8[2] = 0x80 | (codepoint & 0x3F);
return 3;
}
if (codepoint <= 0x10FFFF) {
utf8[0] = 0xF0 | (codepoint >> 18);
utf8[1] = 0x80 | ((codepoint >> 12) & 0x3F);
utf8[2] = 0x80 | ((codepoint >> 6) & 0x3F);
utf8[3] = 0x80 | (codepoint & 0x3F);
return 4;
}
return 0;
}
// Sample usage
uint32_t cp = foo();
uint8_t utf8[4];
unsigned len = Unicode_CodepointToUTF8(utf8, cp);
if (len == 0) Handle_BadCodePoint();
size_t y = fwrite(utf8, 1, len, stream_opened_in_binary_mode);

Creating file that uses UTF-8 encoding

I am trying to create a file and encode its content in the UTF-8 format using C. I have tried several things and looked around but I can not seem to find a solution to the problem.
This is the code I am currently trying (u8_wc_tout8 function taken from here):
int u8_wc_toutf8(char *dest, u_int32_t ch)
{
if (ch < 0x80) {
dest[0] = (char)ch;
return 1;
}
if (ch < 0x800) {
dest[0] = (ch>>6) | 0xC0;
dest[1] = (ch & 0x3F) | 0x80;
return 2;
}
if (ch < 0x10000) {
dest[0] = (ch>>12) | 0xE0;
dest[1] = ((ch>>6) & 0x3F) | 0x80;
dest[2] = (ch & 0x3F) | 0x80;
return 3;
}
if (ch < 0x110000) {
dest[0] = (ch>>18) | 0xF0;
dest[1] = ((ch>>12) & 0x3F) | 0x80;
dest[2] = ((ch>>6) & 0x3F) | 0x80;
dest[3] = (ch & 0x3F) | 0x80;
return 4;
}
return 0;
}
int main ()
{
printf(setlocale(LC_ALL, "")); //Prints C.UTF-8
FILE * fout;
fout=fopen("out.txt","w");
u_int32_t c = 'Å';
char convertedChar[6];
int cNum = u8_wc_toutf8(convertedChar, c);
printf(convertedChar); //Prints ?
fprintf(fout, convertedChar);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
When I run this from the command prompt in Windows it prints ? and when I open the file created I get some weird characters. If I check the encoding in Firefox on the file it says:
"windows-1252"
Are there any better ways to check the encoding of the file?
Any tips to point me in the right direction would be really nice, it feels like this should not be that hard to do.
You should allocate the memory for convertedChar and set c to 197, which is the unicode char id of the angstrom character (Å). Then you can now encode this character in utf-8 or anything else if you want:
int main ()
{
FILE * fout;
fout=fopen("out.txt","wb");
u_int32_t c = 197; // Or 0xC5
char convertedChar[4];
int cNum = u8_wc_toutf8(convertedChar, c);
fwrite(convertedChar, sizeof(char), cNum, fout);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
And in the case, for example, that your locale uses UTF-8 encoding, then you can use this to print the character on your console:
wchar_t wc;
mbtowc(&wc, convertedChar, sizeof(wchar_t));
putwc(wc, stdout);

Mapping multibyte characters to their unicode point representation

How do you map a single UTF-8 character to its unicode point in C?
[For example, È would be mapped to 00c8].
If your platform's wchar_t stores unicode (if it's a 32-bit type, it probably does) and you have an UTF-8 locale, you can call mbrtowc (from C90.1).
mbstate_t state = {0};
wchar_t wch;
char s[] = "\303\210";
size_t n;
memset(&state, 0, sizeof(state));
setlocale(LC_CTYPE, "en_US.utf8"); /*error checking omitted*/
n = mbrtowc(&wch, s, strlen(s), &state);
if (n <= (size_t)-2) printf("%lx\n", (unsigned long)wch);
For more flexibility, you can call the iconv interface.
char s[] = "\303\210";
iconv_t cd = iconv_open("UTF-8", "UCS-4");
if (cd != -1) {
char *inp = s;
size_t ins = strlen(s);
uint32_t c;
uint32_t *outp = &c;
size_t outs = 0;
if (iconv(cd, &inp, &ins, &outp, &outs) + 1 >= 2) printf("%lx\n", c);
iconv_close(cd);
}
Some things to look at :
libiconv
ConvertUTF.h
MultiByteToWideChar (under windows)
An reasonably fast implementation of an UTF-8 to UCS-2 converter. Surrogate and characters outside the BMP left as exercice.
The function returns the number of bytes consumed from the input s string. A negative value represents an error.
The resulting unicode character is put at the address p points to.
int utf8_to_wchar(wchar_t *p, const char *s)
{
const unsigned char *us = (const unsigned char *)s;
p[0] = 0;
if(!*us)
return 0;
else
if(us[0] < 0x80) {
p[0] = us[0];
return 1;
}
else
if(((us[0] & 0xE0) == 0xC0) && (us[1] & 0xC0) == 0x80) {
p[0] = ((us[0] & 0x1F) << 6) | (us[1] & 0x3F);
#ifdef DETECT_OVERLONG
if(p[0] < 0x80) return -2;
#endif
return 2;
}
else
if(((us[0] & 0xF0) == 0xE0) && (us[1] & 0xC0) == 0x80 && (us[2] & 0xC0) == 0x80) {
p[0] = ((us[0] & 0x0F) << 12) | ((us[1] & 0x3F) << 6) | (us[2] & 0x3F);
#ifdef DETECT_OVERLONG
if(p[0] < 0x800) return -2;
#endif
return 3;
}
return -1;
}

Resources