I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.
I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.
My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.
I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.
I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...
I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.
My understanding of how unicode works (in case that's where I'm going wrong):
I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.
I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.
I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب
My understanding of the ways I can process unicode in C:
Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)
Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:
const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";
My problems with this:
I need to manipulate individual characters
Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.
Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)
Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.
My problem with this:
I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.
I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.
(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)
Option C - Use external libraries
I've read in various comments that external libraries are the way to go so I've tried:
C programming library
http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.
My problem:
While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)
ICU library
My problem:
I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.
My code
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>
int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;
FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";
//printf - works
printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");
printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);
//wprintf - english - works
wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');
//wprintf - arabic - doesnt work
wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);
wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');
wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");
fclose(f);
return 0;
}
Output file
printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"
wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""
I'm using Windows 10, Notepad++ and MinGW.
Edit
This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...
I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!
C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.
Is it ... possible to store and process individual UTF-8 characters ...?
First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".
A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.
The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.
char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
printf("<");
char u[5];
char *p = u;
*p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
*p = 0;
printf("%s", u);
printf(">\n");
}
With the output viewed with a UTF8 aware screen:
<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>
An example with utf8proc library to iterate is:
#include <utf8proc.h>
#include <stdio.h>
int main(void) {
utf8proc_uint8_t const string[] = u8"ايه الاخبار";
utf8proc_ssize_t size = sizeof string / sizeof *string - 1;
utf8proc_int32_t data;
utf8proc_ssize_t n;
utf8proc_uint8_t const *pstring = string;
while ((n = utf8proc_iterate(pstring, size, &data)) > 0) {
printf("<%.*s>\n", (int)n, pstring);
pstring += n;
size -= n;
}
}
This is probably not the best way to use this library but I make an issue an github to have some example. Because, I'm unable to understand how work this library.
You need to very clearly understand the difference between a Unicode code point and UTF-8. UTF-8 is a variable byte encoding of Unicode code points. The lower end, values 0-127, is stored as a single byte. That's the main point of UTF-8, and makes it backwards compatible with Ascii.
When bit 7 is set, for values over 127, a variable length code of two bytes or more is used. The leading byte always has the bit pattern 11xxxxxx.
Here's code to get the skip (the number of character used), also to read a codepoint and to write one.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
Using these functions or similar, you convert between code points and UTF-8
and back.
Windows currently uses UTF-16 for its apis. To a first approximation, UTF-16 is the code points in 16 bit format. So when writing a UTF-8 based program, you need to convert the UTF-8 to UTF-16 (using wide chars) immediately before calling Windows output functions.
Support for UTF-8 via printf() is patchy. Passing a UTF-8 encoded string to printf() is unlikely to do what you want.
I have a text file which can contain a mix of Chinese, Japanese, Korean (CJK) and English characters. I have to validate the file for English characters. The file can be allowed to contain CJK characters only when a line begins with the '$' character, which represents a comment in my text file. Searching through the net, I found out that I can use fgetws() and the wchar_t type to read wide chars.
Q1) But I am wondering how CJK characters would be stored in my text file - what byte order etc.
Q2) How can I loop through CJK characters. Since Unicode characters can have 1 to 6 bytes, I cannot use i++.
Any help would be appreciated.
Thanks a lot.
You need to read the UTF-8 file as a sequence of UTF-32 codepoints. For example:
std::shared_ptr<FILE> f(fopen(filename, "r"), fclose);
uint32_t c = 0;
while (utf8_read(f.get(), c))
{
if (is_english_char(c))
...
else if (is_cjk_char(c))
...
else
...
}
Where utf8_read has the signature:
bool utf8_read(FILE *f, uint32_t &c);
Now, utf8_read may read 1-4 bytes depending on the value of the first byte. See http://en.wikipedia.org/wiki/UTF-8, google for an algorithm or use a library function already available to you.
With the UTF-32 codepoint, you can now check ranges. For English, you can check if it is ASCII (c < 0x7F) or if it is a Latin character (Including support for accented characters for imported words from e.g. French). You may also want to exclude non-printable control characters (e.g. 0x01).
For the Latin and/or CJK character checks, you can check if the character is in a given code block (see http://www.unicode.org/Public/UNIDATA/Blocks.txt for the codepoint ranges). This is the simplest approach.
If you are using a library with Unicode support that has writing script detection (e.g. the glib library), you can use the script type to detect the characters. Alternatively, you can get the data from http://www.unicode.org/Public/UNIDATA/Scripts.txt:
Name : Code : Language(s)
=========:===========:========================================================
Common : Zyyy : general punctuation / symbol characters
Latin : Latn : Latin languages (English, German, French, Spanish, ...)
Han : Hans/Hant : Chinese characters (Chinese, Japanese)
Hiragana : Hira : Japanese
Katakana : Kana : Japanese
Hangul : Hang : Korean
NOTE: The script codes come from http://www.iana.org/assignments/language-subtag-registry (Type == 'script').
I am pasting a sample program to illustrate wchar_t handling. Hope it helps someone.
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define BUFLEN 1024
int main() {
wchar_t *wmessage=L"Lets- beginめん(下) 震災後、保存-食で-脚光-(経済ナビゲーター)-lets- end";
wchar_t warray[BUFLEN + 1];
wchar_t a = L'z';
int i=0;
FILE *fp;
wchar_t *token = L"-";
wchar_t *state;
wchar_t *ptr;
setlocale(LC_ALL, "");
/* FIle in current dirrctory containing CJK chars */
fp = fopen("input", "r");
if (fp == NULL) {
printf("%s\n", "Cannot open file!!!");
return (-1);
}
fgetws(warray, BUFLEN, fp);
wprintf(L"\n *********************START reading from file*******************************\n");
wprintf(L"%ls\n",warray);
wprintf(L"\n*********************END reading from file*******************************\n");
fclose(fp);
wprintf(L"printing character %lc = <0x%x>\n", a, a);
wprintf(L"\n*********************START Checking string for Japanese*******************************\n");
for(i=0;wmessage[i] != '\0';i++) {
if (wmessage[i] > 0x7F) {
wprintf(L"\n This is non-ASCII <0x%x> <%lc>", wmessage[i], wmessage[i]);
} else {
wprintf(L"\n This is ASCII <0x%x> <%lc>", wmessage[i], wmessage[i]);
}
}
wprintf(L"\n*********************END Checking string for Japanese*******************************\n");
wprintf(L"\n*********************START Tokenizing******************************\n");
state = wcstok(warray, token, &ptr);
while (state != NULL) {
wprintf(L"\n %ls", state);
state = wcstok(NULL, token, &ptr);
}
wprintf(L"\n*********************END Tokenizing******************************\n");
return 0;
}
You need to understand UTF-8 and use some UTF8 handling library (or code your own). FYI, Glib (from GTK) has UTF-8 handling functions, which are able to deal with variable-length UTF-8 chars & strings. There are other UTF-8 libraries e.g. iconv - inside GNU libc - and ICU and many others.
UTF-8 does define the byte order and content of multi-byte UTF8 characters, e.g. Chinese ones.
I am having problems with converting UTF-8 to Unicode.
Below is the code:
int charset_convert( char * string, char * to_string,char* charset_from, char* charset_to)
{
char *from_buf, *to_buf, *pointer;
size_t inbytesleft, outbytesleft, ret;
size_t TotalLen;
iconv_t cd;
if (!charset_from || !charset_to || !string) /* sanity check */
return -1;
if (strlen(string) < 1)
return 0; /* we are done, nothing to convert */
cd = iconv_open(charset_to, charset_from);
/* Did I succeed in getting a conversion descriptor ? */
if (cd == (iconv_t)(-1)) {
/* I guess not */
printf("Failed to convert string from %s to %s ",
charset_from, charset_to);
return -1;
}
from_buf = string;
inbytesleft = strlen(string);
/* allocate max sized buffer,
assuming target encoding may be 4 byte unicode */
outbytesleft = inbytesleft *4 ;
pointer = to_buf = (char *)malloc(outbytesleft);
memset(to_buf,0,outbytesleft);
memset(pointer,0,outbytesleft);
ret = iconv(cd, &from_buf, &inbytesleft, &pointer, &outbytesleft);ing
memcpy(to_string,to_buf,(pointer-to_buf);
}
main():
int main()
{
char UTF []= {'A', 'B'};
char Unicode[1024]= {0};
char* ptr;
int x=0;
iconv_t cd;
charset_convert(UTF,Unicode,"UTF-8","UNICODE");
ptr = Unicode;
while(*ptr != '\0')
{
printf("Unicode %x \n",*ptr);
ptr++;
}
return 0;
}
It should give A and B but i am getting:
ffffffff
fffffffe
41
Thanks,
Sandeep
It looks like you are getting UTF-16 out in a little endian format:
ff fe 41 00 ...
Which is U+FEFF (ZWNBSP aka byte order mark), U+0041 (latin capital letter A), ...
You then stop printing because your while loop has terminated on the first null byte. The following bytes should be: 42 00.
You should either return a length from your function or make sure that the output is terminated with a null character (U+0000) and loop until you find this.
UTF-8 is Unicode.
You do not need to covert unless you need some other type of Unicode encoding like UTF-16, or UTF-32
UTF is not Unicode. UTF is an encoding of the integers in the Unicode standard. The question, as is, makes no sense. If you mean you want to convert from (any) UTF to the unicode code point (i.e. the integer that stands for an assigned code point, roughly a character), then you need to do a bit of reading, but it involves bit-shifting for the values of the 1, 2, 3 or 4 bytes in UTF-8 byte sequence (see Wikipedia, while Markus Kuhn's text is also excellent)
Unless I am missing something as nobody has pointed it out yet, "UNICODE" isn't a valid encoding name in libiconv as it is the name of a family of encodings.
http://www.gnu.org/software/libiconv/
(edit) Actually iconv -l shows UNICODE as a listed entry but no details, in the source code its listed in the notes as an alias for UNICODE-LITTLE but in the subnotes it mentions:
* UNICODE (big endian), UNICODEFEFF (little endian)
We DON'T implement these because they are stupid and not standardized.
In the aliases header files UNICODELITTLE (no hyphen) resolves as follows:
lib/aliases.gperf:UNICODELITTLE, ei_ucs2le
i.e. UCS2-LE (UTF-16 Little Endian), which should match Windows internal "Unicode" encoding.
http://en.wikipedia.org/wiki/UTF-16/UCS-2
However you are clearly recommended to explicitly specify UCS2-LE or UCS2-BE unless the first bytes are a Byte Order Mark (BOM) value 0xfeff to indicate byte order scheme.
=> You are seeing the BOM as the first bytes of the output because that is what the "UNICODE" encoding name means, it means UCS2 with a header indicating the byte order scheme.