First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I'm using Ubuntu for reference.
I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a 'Ç', the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:
int main (int argc, char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf ("%i ", argv[1][i]);
i++;
}
}
}
I did some research and found that i should try unsigned char argv instead of char, like this:
int main (int argc, unsigned char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf("%i ", argv[1][i]);
i++;
}
}
}
In this case, I run the program with a 'Ç' and the output is 195 135 (still wrong).
How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.
Thank you!!
Your platform is using UTF-8 Encoding.
Unicode Latin Capital Letter C with Cedilla (U+00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.
In turn those bytes in decimal are 195 and 135 which you see in output.
Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127).
That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.
You may find there's tools on your platform to convert that to extended ASCII but I suspect you don't want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).
It's Unicode Code Point 199 so unless you have a specific application for Extended ASCII you'll probably just make things worse by converting to it. That's not least because it's a much smaller set of characters than Unicode.
Here's some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm
There are various ways of representing non-ASCII characters, such as Ç. Your question suggests you're familiar with 8-bit character sets such as ISO-8859, where in several of its variants Ç does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)
But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.
In C, one way to extract wide characters from a multibyte character string is the function mbtowc. Here is a modification of your program, using this function:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>
int main (int argc, char **argv)
{
setlocale(LC_CTYPE, "");
if (argc == 2)
{
char *p = argv[1];
int n;
wchar_t wc;
while((n = mbtowc(&wc, p, strlen(p))) > 0)
{
printf ("%lc: %d (%d)\n", wc, wc, n);
p += n;
}
}
}
You give mbtowc a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variable wc. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.
When I run this program on the string abÇd, it prints
a: 97 (1)
b: 98 (1)
Ç: 199 (2)
d: 100 (1)
This shows that in Unicode (just like 8859-1), Ç has the code 199, but it takes two bytes to encode it.
Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as $LANG. That's what the call setlocale(LC_CTYPE, "") is for: it tells the C library to pay attention to the environment to select a locale for the program's functions, like mbtowc, to use.
Unicode is of course huge, encoding thousands and thousands of characters. Here's the output of the modified version of your program on the string "abΣ∫😊":
a: 97 (1)
b: 98 (1)
Σ: 931 (2)
∫: 8747 (3)
😊: 128522 (4)
Emoji like 😊 typically take four bytes to encode in UTF-8.
I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.
I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.
My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.
I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.
I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...
I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.
My understanding of how unicode works (in case that's where I'm going wrong):
I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.
I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.
I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب
My understanding of the ways I can process unicode in C:
Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)
Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:
const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";
My problems with this:
I need to manipulate individual characters
Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.
Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)
Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.
My problem with this:
I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.
I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.
(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)
Option C - Use external libraries
I've read in various comments that external libraries are the way to go so I've tried:
C programming library
http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.
My problem:
While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)
ICU library
My problem:
I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.
My code
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>
int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;
FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";
//printf - works
printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");
printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);
//wprintf - english - works
wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');
//wprintf - arabic - doesnt work
wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);
wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');
wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");
fclose(f);
return 0;
}
Output file
printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"
wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""
I'm using Windows 10, Notepad++ and MinGW.
Edit
This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...
I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!
C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.
Is it ... possible to store and process individual UTF-8 characters ...?
First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".
A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.
The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.
char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
printf("<");
char u[5];
char *p = u;
*p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
*p = 0;
printf("%s", u);
printf(">\n");
}
With the output viewed with a UTF8 aware screen:
<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>
An example with utf8proc library to iterate is:
#include <utf8proc.h>
#include <stdio.h>
int main(void) {
utf8proc_uint8_t const string[] = u8"ايه الاخبار";
utf8proc_ssize_t size = sizeof string / sizeof *string - 1;
utf8proc_int32_t data;
utf8proc_ssize_t n;
utf8proc_uint8_t const *pstring = string;
while ((n = utf8proc_iterate(pstring, size, &data)) > 0) {
printf("<%.*s>\n", (int)n, pstring);
pstring += n;
size -= n;
}
}
This is probably not the best way to use this library but I make an issue an github to have some example. Because, I'm unable to understand how work this library.
You need to very clearly understand the difference between a Unicode code point and UTF-8. UTF-8 is a variable byte encoding of Unicode code points. The lower end, values 0-127, is stored as a single byte. That's the main point of UTF-8, and makes it backwards compatible with Ascii.
When bit 7 is set, for values over 127, a variable length code of two bytes or more is used. The leading byte always has the bit pattern 11xxxxxx.
Here's code to get the skip (the number of character used), also to read a codepoint and to write one.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
Using these functions or similar, you convert between code points and UTF-8
and back.
Windows currently uses UTF-16 for its apis. To a first approximation, UTF-16 is the code points in 16 bit format. So when writing a UTF-8 based program, you need to convert the UTF-8 to UTF-16 (using wide chars) immediately before calling Windows output functions.
Support for UTF-8 via printf() is patchy. Passing a UTF-8 encoded string to printf() is unlikely to do what you want.
I'm writing a little wrapper for an application that uses files as arguments.
The wrapper needs to be in Unicode, so I'm using wchar_t for the characters and strings I have. Now I find myself in a problem, I need to have the arguments of the program in a array of wchar_t's and in a wchar_t string.
Is it possible? I'm defining the main function as
int main(int argc, char *argv[])
Should I use wchar_t's for argv?
Thank you very much, I seem not to find useful info on how to use Unicode properly in C.
In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.
Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.
On a Mac (10.5.8, Leopard), I got:
Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A ......
0x0006:
Osiris JL:
That's all UTF-8 encoded. (odx is a hex dump program).
See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment
Portable code doesn't support it. Windows (for example) supports using wmain instead of main, in which case argv is passed as wide characters.
On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce an argv-style wchar_t[] array, even if the app is not compiled for Unicode.
On Windows anyway, you can have a wmain() for UNICODE builds. Not portable though. I dunno if GCC or Unix/Linux platforms provide anything similar.
Assuming that your Linux environment uses UTF-8 encoding then the following code will prepare your program for easy Unicode treatment in C++:
int main(int argc, char * argv[]) {
std::setlocale(LC_CTYPE, "");
// ...
}
Next, wchar_t type is 32-bit in Linux, which means it can hold individual Unicode code points and you can safely use wstring type for classical string processing in C++ (character by character). With setlocale call above, inserting into wcout will automatically translate your output into UTF-8 and extracting from wcin will automatically translate UTF-8 input into UTF-32 (1 character = 1 code point). The only problem that remains is that argv[i] strings are still UTF-8 encoded.
You can use the following function to decode UTF-8 into UTF-32. If the input string is corrupted it will return properly converted characters until the place where the UTF-8 rules were broken. You could improve it if you need more error reporting. But for argv data one can safely assume that it is correct UTF-8:
#define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))
wstring Convert(const char * s) {
typedef unsigned char byte;
struct Level {
byte Head, Data, Null;
Level(byte h, byte d) {
Head = h; // the head shifted to the right
Data = d; // number of data bits
Null = h << d; // encoded byte with zero data bits
}
bool encoded(byte b) { return b>>Data == Head; }
}; // struct Level
Level lev[] = {
Level(2, 6),
Level(6, 5),
Level(14, 4),
Level(30, 3),
Level(62, 2),
Level(126, 1)
};
wchar_t wc = 0;
const char * p = s;
wstring result;
while (*p != 0) {
byte b = *p++;
if (b>>7 == 0) { // deal with ASCII
wc = b;
result.push_back(wc);
continue;
} // ASCII
bool found = false;
for (int i = 1; i < ARR_LEN(lev); ++i) {
if (lev[i].encoded(b)) {
wc = b ^ lev[i].Null; // remove the head
wc <<= lev[0].Data * i;
for (int j = i; j > 0; --j) { // trailing bytes
if (*p == 0) return result; // unexpected
b = *p++;
if (!lev[0].encoded(b)) // encoding corrupted
return result;
wchar_t tmp = b ^ lev[0].Null;
wc |= tmp << lev[0].Data*(j-1);
} // trailing bytes
result.push_back(wc);
found = true;
break;
} // lev[i]
} // for lev
if (!found) return result; // encoding incorrect
} // while
return result;
} // wstring Convert
On Windows, you can use tchar.h and _tmain, which will be turned into wmain if the _UNICODE symbol is defined at compile time, or main otherwise. TCHAR *argv[] will similarly be expanded to WCHAR * argv[] if unicode is defined, and char * argv[] if not.
If you want to have your main method work cross platform, you can define your own macros to the same effect.
TCHAR.h contains a number of convenience macros for conversion between wchar and char.