Unable to print the character 'à' with the printf function in C - c

I would like to understand why I can print the character 'à' with the functions fopen and fgetc when I read a .txt file but I can't assign it to a char variable and print it with the printf function.
When I read the file texte.txt, the output is:
(Here is a letter that we often use in French: à)
The letter 'à' is correctly read by the fgetc function and assigned to the char c variable
See the code below:
int main() {
FILE *fp;
fp=fopen("texte.txt", "r");
if (fp==NULL) {
printf("erreur fopen");
return 1;
}
char c = fgetc(fp);
while(c != EOF) {
printf("%c", c);
c = fgetc(fp);
}
printf("\n");
return 0;
}
But now if I try to assign the 'à' character to a char variable, I get an error!
See the code below:
int main() {
char myChar = 'à';
printf("myChar is: %c\n", myChar);
return 0;
}
ERROR:
./main.c:26:15: error: character too large for enclosing character literal type
char myChar = 'à';
My knowledge in C is very insufficient, and I can't find an answer anywhere

To print à you can use wide character (or wide string):
#include <wchar.h> // wchar_t
#include <stdio.h>
#include <locale.h> // setlocale LC_ALL
int main() {
setlocale(LC_ALL, "");
wchar_t a = L'à';
printf("%lc\n", a);
}
In short: Characters have encoding. Program "locale" chooses what encoding is used by the standard library functions. A wide character represents a locale-agnostic character, a character that is "able" to be converted to/from any locale. setlocale set's your program locale to the locale of your terminal. This is needed so that printf knows how to convert wide character à to the encoding of your terminal. L in front of a character or string makes it a wide. On Linux, wide characters are in UTF-32.
Handling encodings might be hard. I can point to: https://en.wikipedia.org/wiki/Character_encoding , https://en.wikipedia.org/wiki/ASCII , https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html , https://en.cppreference.com/w/cpp/locale , https://en.wikipedia.org/wiki/Unicode .
You can encode a multibyte string straight in your source code and output. This will work only if your compiler generates code for the multibyte string in the same encoding as your terminal works with. If you change your terminal encoding, or tell your compiler to use a different encoding, it may fail. On Linux, UTF-8 is everywhere, compilers generate UTF-8 string and terminals understand UTF-8.
const char *str = "à";
printf("%s\n", str);

Related

Why can't I print the decimal value of a extended ASCII char like 'Ç'? in C

First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I'm using Ubuntu for reference.
I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a 'Ç', the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:
int main (int argc, char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf ("%i ", argv[1][i]);
i++;
}
}
}
I did some research and found that i should try unsigned char argv instead of char, like this:
int main (int argc, unsigned char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf("%i ", argv[1][i]);
i++;
}
}
}
In this case, I run the program with a 'Ç' and the output is 195 135 (still wrong).
How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.
Thank you!!
Your platform is using UTF-8 Encoding.
Unicode Latin Capital Letter C with Cedilla (U+00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.
In turn those bytes in decimal are 195 and 135 which you see in output.
Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127).
That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.
You may find there's tools on your platform to convert that to extended ASCII but I suspect you don't want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).
It's Unicode Code Point 199 so unless you have a specific application for Extended ASCII you'll probably just make things worse by converting to it. That's not least because it's a much smaller set of characters than Unicode.
Here's some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm
There are various ways of representing non-ASCII characters, such as Ç. Your question suggests you're familiar with 8-bit character sets such as ISO-8859, where in several of its variants Ç does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)
But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.
In C, one way to extract wide characters from a multibyte character string is the function mbtowc. Here is a modification of your program, using this function:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>
int main (int argc, char **argv)
{
setlocale(LC_CTYPE, "");
if (argc == 2)
{
char *p = argv[1];
int n;
wchar_t wc;
while((n = mbtowc(&wc, p, strlen(p))) > 0)
{
printf ("%lc: %d (%d)\n", wc, wc, n);
p += n;
}
}
}
You give mbtowc a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variable wc. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.
When I run this program on the string abÇd, it prints
a: 97 (1)
b: 98 (1)
Ç: 199 (2)
d: 100 (1)
This shows that in Unicode (just like 8859-1), Ç has the code 199, but it takes two bytes to encode it.
Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as $LANG. That's what the call setlocale(LC_CTYPE, "") is for: it tells the C library to pay attention to the environment to select a locale for the program's functions, like mbtowc, to use.
Unicode is of course huge, encoding thousands and thousands of characters. Here's the output of the modified version of your program on the string "abΣ∫😊":
a: 97 (1)
b: 98 (1)
Σ: 931 (2)
∫: 8747 (3)
😊: 128522 (4)
Emoji like 😊 typically take four bytes to encode in UTF-8.

I would like to print superscript and subscript with printf, like x¹?

I want to print out a polynomial expression in c but i don't know print x to the power of a number with printf
It's far from trivial unfortunately. You cannot achieve what you want with printf. You need wprintf. Furthermore, it's not trivial to translate between normal and superscript. You would like a function like this:
wchar_t digit_to_superscript(int d) {
wchar_t table[] = { // Unicode values
0x2070,
0x00B9, // Note that 1, 2 and 3 does not follow the pattern
0x00B2, // That's because those three were common in various
0x00B3, // extended ascii tables. The rest did not exist
0x2074, // before unicode
0x2075,
0x2076,
0x2077,
0x2078,
0x2079,
};
return table[d];
}
This function could of course be changed to handle other characters too, as long as they are supported. And you could also write more complete functions operating on complete strings.
But as I said, it's not trivial, and it cannot be done with simple format strings to printf, and not even to wprintf.
Here is a somewhat working example. It's usable, but it's very short because I have omitted all error checking and such. Shortest possible to be able to use a negative float number as exponent.
#include <wchar.h>
#include <locale.h>
wchar_t char_to_superscript(wchar_t c) {
wchar_t digit_table[] = {
0x2070, 0x00B9, 0x00B2, 0x00B3, 0x2074,
0x2075, 0x2076, 0x2077, 0x2078, 0x2079,
};
if(c >= '0' && c <= '9') return digit_table[c - '0'];
switch(c) {
case '.': return 0x22C5;
case '-': return 0x207B;
}
}
void number_to_superscript(wchar_t *dest, wchar_t *src) {
while(*src){
*dest = char_to_superscript(*src);
src++;
dest++;
}
dest++;
*dest = 0;
}
And a main function to demonstrate:
int main(void) {
setlocale(LC_CTYPE, "");
double x = -3.5;
wchar_t wstr[100], a[100];
swprintf(a, 100, L"%f", x);
wprintf(L"Number as a string: %ls\n", a);
number_to_superscript(wstr, a);
wprintf(L"Number as exponent: x%ls\n", wstr);
}
Output:
Number as a string: -3.500000
Number as exponent: x⁻³⋅⁵⁰⁰⁰⁰⁰
In order to make a complete translator, you would need something like this:
size_t superscript_index(wchar_t c) {
// Code
}
wchar_t to_superscript(wchar_t c) {
static wchar_t huge_table[] {
// Long list of values
};
return huge_table[superscript_index(c)];
}
Remember that this cannot be done for all characters. Only those whose counterpart exists as a superscript version.
Unfortunately, it is not possible to output formatted text with printf.
(Of course one could output HTML format, but this then would need to be fed into an interpreter first for correct display)
So you cannot print text in superscript format in the general case.
What you have found is the superscript 1 as a special character. However this is only possible with 1 and 2, if I remember correctly (and only for the right code-page, not in plain ASCII).
The common way to print "superscripts" is to use the x^2, x^3 syntax. This is commonly understood.
An alternative is provided by klutt's answer. If you switch to unicode by using wprintf instead of printf you could use all superscript characters from 0 to 9. Even though, I am not sure how multi-digit exponents look like in a fixed-width terminal it works in principle.
If you want to print superscript 1, you need to use unicode. You can combine unicode superscripts to write a multi-digit number.
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
setlocale(LC_CTYPE, "");
wchar_t one = 0x00B9;
wchar_t two = 0x00B2;
wprintf(L"x%lc\n", one);
wprintf(L"x%lc%lc\n", one, two);
}
Output:
$ clang ~/lab/unicode.c
$ ./a.out
x¹
x¹²
Ref: https://www.compart.com/en/unicode/U+00B9

Displaying wide character in hexadecimal hex shows unexpected result

I am trying to display a wide character in hexadecimal and it gives me unexpected results and it would be always like 2 digit hex and my code.
#include "stdlib.h"
#include "stdio.h"
#include"wchar.h"
#include "locale.h"
int main(){
setlocale(LC_ALL,"");
wchar_t ch;
wscanf (L"%lc",&ch);
wprintf(L"%x \n",ch);
return 0;
}
input : Ω
result: 0xea
expected result : 0xcea9
I changed setlocale several times but the results always be the same.
notice
When the input value is smaller than 1 byte it works as expected.
Note that you should use <..> for including standard headers. The line wprintf("%x", ch) is invalid, cause it's most probably undefined behavior - ch is (possibly) not an unsigned int, you can't apply %x on it.
You are expecting that wide characters will be stored in UTF-8. Well, that wouldn't make much sense, they are not. Your program reads a sequence of bytes in multibyte encoding and that sequence of bytes is then converted (depending on locale) to the wide character encoding. The wide character encoding (usually) stays the same and should be UTF-32 on linux. Locale affects the way multibyte characters are converted to wide characters and back, not the representation of wide characters.
The following program:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(){
setlocale(LC_ALL,"");
wchar_t ch;
int cnt = wscanf(L"%lc",&ch);
if (cnt != 1) { /* handle error */ abort(); }
wprintf(L"%x\n", (unsigned int)ch);
return 0;
}
On linux when inputted Greek Capital Letter Omega Ω U+3A9 the program outputs 3a9. What actually happens is that the terminal reads UTF-8 encoded character, so it reads two bytes 0xCE 0xA9, then converts them to UTF-32 and stores the result in the wide character. You may convert the wide character from wide character encoding (UTF-32) to multibyte character encoding (UTF-8 should be default, but depends on locale) and print the bytes that represent the character in multibyte character encoding:
char tmp[MB_CUR_MAX];
int len = wctomb(tmp, ch); // prefer wcrtomb
if (len < 0) { /* handle error */ abort(); }
for (int i = 0; i < len; ++i) {
wprintf(L"%hhx", (unsigned char)tmp[i]);
}
wprintf(L"\n");
That will output cea9 on my platform.

Why Unicode characters are not displayed properly in terminal with GCC?

I've written a small C program:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main() {
wprintf(L"%s\n", setlocale(LC_ALL, "C.UTF-8"));
wchar_t chr = L'┐';
wprintf(L"%c\n", chr);
}
Why doesn't this print the character ┐ ?
Instead it prints gibberish.
I've checked:
tried compiling without setlocale, same result
the terminal itself can print the character, I can copy-paste it to terminal from text-editor, it's gnome-terminal on Ubuntu
GCC version is 4.8.2
wprintf is a version of printf which takes a wide string as its format string, but otherwise behaves just the same: %c is still treated as char, not wchar_t. So instead you need to use %lc to format a wide character. And since your strings are ASCII you may as well use printf. For example:
int main() {
printf("%s\n", setlocale(LC_ALL, "C.UTF-8"));
wchar_t chr = L'┐';
printf("%lc\n", chr);
}

Reading CJK characters from an input file in C

I have a text file which can contain a mix of Chinese, Japanese, Korean (CJK) and English characters. I have to validate the file for English characters. The file can be allowed to contain CJK characters only when a line begins with the '$' character, which represents a comment in my text file. Searching through the net, I found out that I can use fgetws() and the wchar_t type to read wide chars.
Q1) But I am wondering how CJK characters would be stored in my text file - what byte order etc.
Q2) How can I loop through CJK characters. Since Unicode characters can have 1 to 6 bytes, I cannot use i++.
Any help would be appreciated.
Thanks a lot.
You need to read the UTF-8 file as a sequence of UTF-32 codepoints. For example:
std::shared_ptr<FILE> f(fopen(filename, "r"), fclose);
uint32_t c = 0;
while (utf8_read(f.get(), c))
{
if (is_english_char(c))
...
else if (is_cjk_char(c))
...
else
...
}
Where utf8_read has the signature:
bool utf8_read(FILE *f, uint32_t &c);
Now, utf8_read may read 1-4 bytes depending on the value of the first byte. See http://en.wikipedia.org/wiki/UTF-8, google for an algorithm or use a library function already available to you.
With the UTF-32 codepoint, you can now check ranges. For English, you can check if it is ASCII (c < 0x7F) or if it is a Latin character (Including support for accented characters for imported words from e.g. French). You may also want to exclude non-printable control characters (e.g. 0x01).
For the Latin and/or CJK character checks, you can check if the character is in a given code block (see http://www.unicode.org/Public/UNIDATA/Blocks.txt for the codepoint ranges). This is the simplest approach.
If you are using a library with Unicode support that has writing script detection (e.g. the glib library), you can use the script type to detect the characters. Alternatively, you can get the data from http://www.unicode.org/Public/UNIDATA/Scripts.txt:
Name : Code : Language(s)
=========:===========:========================================================
Common : Zyyy : general punctuation / symbol characters
Latin : Latn : Latin languages (English, German, French, Spanish, ...)
Han : Hans/Hant : Chinese characters (Chinese, Japanese)
Hiragana : Hira : Japanese
Katakana : Kana : Japanese
Hangul : Hang : Korean
NOTE: The script codes come from http://www.iana.org/assignments/language-subtag-registry (Type == 'script').
I am pasting a sample program to illustrate wchar_t handling. Hope it helps someone.
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define BUFLEN 1024
int main() {
wchar_t *wmessage=L"Lets- beginめん(下) 震災後、保存-食で-脚光-(経済ナビゲーター)-lets- end";
wchar_t warray[BUFLEN + 1];
wchar_t a = L'z';
int i=0;
FILE *fp;
wchar_t *token = L"-";
wchar_t *state;
wchar_t *ptr;
setlocale(LC_ALL, "");
/* FIle in current dirrctory containing CJK chars */
fp = fopen("input", "r");
if (fp == NULL) {
printf("%s\n", "Cannot open file!!!");
return (-1);
}
fgetws(warray, BUFLEN, fp);
wprintf(L"\n *********************START reading from file*******************************\n");
wprintf(L"%ls\n",warray);
wprintf(L"\n*********************END reading from file*******************************\n");
fclose(fp);
wprintf(L"printing character %lc = <0x%x>\n", a, a);
wprintf(L"\n*********************START Checking string for Japanese*******************************\n");
for(i=0;wmessage[i] != '\0';i++) {
if (wmessage[i] > 0x7F) {
wprintf(L"\n This is non-ASCII <0x%x> <%lc>", wmessage[i], wmessage[i]);
} else {
wprintf(L"\n This is ASCII <0x%x> <%lc>", wmessage[i], wmessage[i]);
}
}
wprintf(L"\n*********************END Checking string for Japanese*******************************\n");
wprintf(L"\n*********************START Tokenizing******************************\n");
state = wcstok(warray, token, &ptr);
while (state != NULL) {
wprintf(L"\n %ls", state);
state = wcstok(NULL, token, &ptr);
}
wprintf(L"\n*********************END Tokenizing******************************\n");
return 0;
}
You need to understand UTF-8 and use some UTF8 handling library (or code your own). FYI, Glib (from GTK) has UTF-8 handling functions, which are able to deal with variable-length UTF-8 chars & strings. There are other UTF-8 libraries e.g. iconv - inside GNU libc - and ICU and many others.
UTF-8 does define the byte order and content of multi-byte UTF8 characters, e.g. Chinese ones.

Resources