Displaying wide character in hexadecimal hex shows unexpected result - c

I am trying to display a wide character in hexadecimal and it gives me unexpected results and it would be always like 2 digit hex and my code.
#include "stdlib.h"
#include "stdio.h"
#include"wchar.h"
#include "locale.h"
int main(){
setlocale(LC_ALL,"");
wchar_t ch;
wscanf (L"%lc",&ch);
wprintf(L"%x \n",ch);
return 0;
}
input : Ω
result: 0xea
expected result : 0xcea9
I changed setlocale several times but the results always be the same.
notice
When the input value is smaller than 1 byte it works as expected.

Note that you should use <..> for including standard headers. The line wprintf("%x", ch) is invalid, cause it's most probably undefined behavior - ch is (possibly) not an unsigned int, you can't apply %x on it.
You are expecting that wide characters will be stored in UTF-8. Well, that wouldn't make much sense, they are not. Your program reads a sequence of bytes in multibyte encoding and that sequence of bytes is then converted (depending on locale) to the wide character encoding. The wide character encoding (usually) stays the same and should be UTF-32 on linux. Locale affects the way multibyte characters are converted to wide characters and back, not the representation of wide characters.
The following program:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(){
setlocale(LC_ALL,"");
wchar_t ch;
int cnt = wscanf(L"%lc",&ch);
if (cnt != 1) { /* handle error */ abort(); }
wprintf(L"%x\n", (unsigned int)ch);
return 0;
}
On linux when inputted Greek Capital Letter Omega Ω U+3A9 the program outputs 3a9. What actually happens is that the terminal reads UTF-8 encoded character, so it reads two bytes 0xCE 0xA9, then converts them to UTF-32 and stores the result in the wide character. You may convert the wide character from wide character encoding (UTF-32) to multibyte character encoding (UTF-8 should be default, but depends on locale) and print the bytes that represent the character in multibyte character encoding:
char tmp[MB_CUR_MAX];
int len = wctomb(tmp, ch); // prefer wcrtomb
if (len < 0) { /* handle error */ abort(); }
for (int i = 0; i < len; ++i) {
wprintf(L"%hhx", (unsigned char)tmp[i]);
}
wprintf(L"\n");
That will output cea9 on my platform.

Related

Unable to print the character 'à' with the printf function in C

I would like to understand why I can print the character 'à' with the functions fopen and fgetc when I read a .txt file but I can't assign it to a char variable and print it with the printf function.
When I read the file texte.txt, the output is:
(Here is a letter that we often use in French: à)
The letter 'à' is correctly read by the fgetc function and assigned to the char c variable
See the code below:
int main() {
FILE *fp;
fp=fopen("texte.txt", "r");
if (fp==NULL) {
printf("erreur fopen");
return 1;
}
char c = fgetc(fp);
while(c != EOF) {
printf("%c", c);
c = fgetc(fp);
}
printf("\n");
return 0;
}
But now if I try to assign the 'à' character to a char variable, I get an error!
See the code below:
int main() {
char myChar = 'à';
printf("myChar is: %c\n", myChar);
return 0;
}
ERROR:
./main.c:26:15: error: character too large for enclosing character literal type
char myChar = 'à';
My knowledge in C is very insufficient, and I can't find an answer anywhere
To print à you can use wide character (or wide string):
#include <wchar.h> // wchar_t
#include <stdio.h>
#include <locale.h> // setlocale LC_ALL
int main() {
setlocale(LC_ALL, "");
wchar_t a = L'à';
printf("%lc\n", a);
}
In short: Characters have encoding. Program "locale" chooses what encoding is used by the standard library functions. A wide character represents a locale-agnostic character, a character that is "able" to be converted to/from any locale. setlocale set's your program locale to the locale of your terminal. This is needed so that printf knows how to convert wide character à to the encoding of your terminal. L in front of a character or string makes it a wide. On Linux, wide characters are in UTF-32.
Handling encodings might be hard. I can point to: https://en.wikipedia.org/wiki/Character_encoding , https://en.wikipedia.org/wiki/ASCII , https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html , https://en.cppreference.com/w/cpp/locale , https://en.wikipedia.org/wiki/Unicode .
You can encode a multibyte string straight in your source code and output. This will work only if your compiler generates code for the multibyte string in the same encoding as your terminal works with. If you change your terminal encoding, or tell your compiler to use a different encoding, it may fail. On Linux, UTF-8 is everywhere, compilers generate UTF-8 string and terminals understand UTF-8.
const char *str = "à";
printf("%s\n", str);

Why can't I print the decimal value of a extended ASCII char like 'Ç'? in C

First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I'm using Ubuntu for reference.
I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a 'Ç', the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:
int main (int argc, char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf ("%i ", argv[1][i]);
i++;
}
}
}
I did some research and found that i should try unsigned char argv instead of char, like this:
int main (int argc, unsigned char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf("%i ", argv[1][i]);
i++;
}
}
}
In this case, I run the program with a 'Ç' and the output is 195 135 (still wrong).
How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.
Thank you!!
Your platform is using UTF-8 Encoding.
Unicode Latin Capital Letter C with Cedilla (U+00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.
In turn those bytes in decimal are 195 and 135 which you see in output.
Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127).
That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.
You may find there's tools on your platform to convert that to extended ASCII but I suspect you don't want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).
It's Unicode Code Point 199 so unless you have a specific application for Extended ASCII you'll probably just make things worse by converting to it. That's not least because it's a much smaller set of characters than Unicode.
Here's some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm
There are various ways of representing non-ASCII characters, such as Ç. Your question suggests you're familiar with 8-bit character sets such as ISO-8859, where in several of its variants Ç does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)
But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.
In C, one way to extract wide characters from a multibyte character string is the function mbtowc. Here is a modification of your program, using this function:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>
int main (int argc, char **argv)
{
setlocale(LC_CTYPE, "");
if (argc == 2)
{
char *p = argv[1];
int n;
wchar_t wc;
while((n = mbtowc(&wc, p, strlen(p))) > 0)
{
printf ("%lc: %d (%d)\n", wc, wc, n);
p += n;
}
}
}
You give mbtowc a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variable wc. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.
When I run this program on the string abÇd, it prints
a: 97 (1)
b: 98 (1)
Ç: 199 (2)
d: 100 (1)
This shows that in Unicode (just like 8859-1), Ç has the code 199, but it takes two bytes to encode it.
Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as $LANG. That's what the call setlocale(LC_CTYPE, "") is for: it tells the C library to pay attention to the environment to select a locale for the program's functions, like mbtowc, to use.
Unicode is of course huge, encoding thousands and thousands of characters. Here's the output of the modified version of your program on the string "abΣ∫😊":
a: 97 (1)
b: 98 (1)
Σ: 931 (2)
∫: 8747 (3)
😊: 128522 (4)
Emoji like 😊 typically take four bytes to encode in UTF-8.

What is an encoding error for sprintf that should return -1?

I understand snprintf will return a negative value when "an encoding error occurs"
But what is a simple example of such an "encoding error" that will produce that result?
I'm working with gcc 10.2.0 C compiler, and I've tried malformed format specifiers, unreasonably large numbers for field length, and even null format strings.
Malformed format specifiers just get printed literally
Unreasonably large numbers as length specifiers produce fatal errors
Null format strings also produce fatal errors
This relates to repeatedly doing something like:
length += snprintf(...
to build up a formatted string.
That might be safe if it is certain not to return a negative value.
Advancing the buffer pointer by a negative length could cause it to go out of bounds. But I'm looking for a case where that would actually happen. If there is such a case then the added complexity of this may be warranted:
length += result = snprintf(...
So far I couldn't find a scenario where it would be worth adding complexity for a check of a value that the compiler may never produce. Maybe you can give a simple example of one.
What is an encoding error for sprintf that should return -1?
On my machine, "%ls" did not like the 0xFFFF - certainly an encoding error.
char buf[42];
wchar_t s[] = { 0xFFFF,49,50,51,0 };
int i = snprintf(buf, sizeof buf, "<%ls>", s);
printf("%d\n", i);
Output
-1
Below code returned -1, but not so much due to encoding error as for pathological format.
#include <stdio.h>
int main() {
size_t n = 0xFFFFFFFFLLu + 1;
char *fmt = malloc(n);
if (fmt == NULL) {
puts("OOM");
return -42;
}
memset(fmt, 'x', n);
fmt[n - 1] = '\0';
char buf[42];
int i = snprintf(buf, sizeof buf, fmt);
printf("%d %x\n", i, (unsigned) i);
free(fmt);
return 7;
}
Output
-1 ffffffff
I did get a surprising -1 when passing a too big a size, even though the snprintf() only needed 6 bytes.
char buf[42];
int i = snprintf(buf, 4299195472, "Hello");
printf("%d\n", i);
Output
-1
I was able to come up with a short example returning -1 on a *fprintf() to stdout due to orientation conflict.
#include <wchar.h>
#include <stdio.h>
int main() {
int w = wprintf(L"Hello wide world\n");
wprintf(L"%d\n", w);
int s = printf("Hello world\n");
wprintf(L"%d\n", s);
}
Output
Hello wide world
17
-1
Normally you only expect an error from printf and family when an output error occurs. From the Linux man page:
If an output error is encountered, a negative value is returned.
So if you are outputting to a FILE and an output error of some kind (EPIPE, EIO) occurs, you'll get a negative return value. For s[n]printf, since there's no output, there would never be a negative return value.
The standard talks about the possibility of an "encoding error", but only defines what that means with respect to wide character streams, with a note that byte streams might need to convert to wide streams in some cases.
An encoding error occurs if the character sequence presented to the underlying
mbrtowc function does not form a valid (generalized) multibyte character, or if the code
value passed to the underlying wcrtomb does not correspond to a valid (generalized) multibyte character. The wide character input/output functions and the byte input/output
functions store the value of the macro EILSEQ in errno if and only if an encoding error
occurs.
That would seem to imply that you can get an encoding error if you use the %ls or %lc formats to convert a wide string or characters to bytes. Not sure if there are any other cases where it could occur.

printing the char value of each wide character's bytes

when running the following:
char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
printf("acute_accent[%d]: %c\n", i, acute_accent[i]);
}
I get:
acute_accent[0]:
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r
which makes me think that the multibyte character é is 2-byte wide.
However, when running this (after ignoring the compiler warning me from multi-character character constant):
printf("size: %lu",sizeof('é'));
I get size: 4.
What's the reason for the different sizes?
EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.
The reason you're seeing a discrepancy is because in your first example, the character é was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9.
See here:
http://www.fileformat.info/info/unicode/char/e9/index.htm
And as described by dbush, the character 'é' was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.
Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.
To prevent undefined behavior you should always clearly identify the encoding type for string literals.
For example:
char acute_accent[7] = u8"éclair"
This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.
It's much safer to use this instead:
const char* acute_accent = u8"éclair"
Notice how your string is actually 8-bytes:
#include <stdio.h>
#include <string.h> // strlen
int main() {
const char* a = u8"éclair";
printf("String length : %lu\n", strlen(a));
// Add +1 for the null byte
printf("String size : %lu\n", strlen(a) + 1);
return 0;
}
The output is:
String length : 7
String size : 8
Also note that the size of a char is different between C and C++!!
#include <stdio.h>
int main() {
printf("%lu\n", sizeof('a'));
printf("%lu\n", sizeof('é'));
return 0;
}
In C the output is:
4
4
While in C++ the output is:
1
4
From the C99 standard, section 6.4.4.4:
2 An integer character constant is a sequence of one or more multibyte
characters enclosed in single-quotes, as in 'x'.
...
10 An integer character constant has type int.
sizeof(int) on your machine is probably 4, which is why you're getting that result.
So 'é', 'c', 'l' are all integer character constants, so all are of type int whose size is 4. The fact that some are multibyte and some are not doesn't matter in this regard.

C get unicode code point for character

How can I get the Unicode code point for a character? Here is what I have tried, but it is not printing the same character, Am I properley understanding how unicode works?
How can I get the value of a unicode character?
#include <stdio.h>
int main()
{
char *a = "ā";
int n;
while(a[n] != '\0')
{
printf("%x", a[n]);
n+=1;
}
printf("\n \uC481");
return 0;
}
In the first place, there are few corrections in your code.
#include <stdio.h>
int main()
{
char *a = "ā";
int n = 0; //Initialize n with zero.
while(a[n] != '\0')
{
printf("%x", a[n]);
n+=1;
}
//\u will not work. To print hexadecimal value, use \x
printf("\n %X\n\", 0xC481);
return 0;
}
Here, you are trying to print hex value of each byte. This will be not a Unicode value of character beyond 0xff.
unsigned short is the most common data structure used to store Unicode value although it cannot store all the code points. If you need to store all the Unicode points as it is, then use int which must be 32-bit.
Unicode value of a character is numeric value of each character when it is represented in UTF-32. Otherwise, you will have to compute from the byte sequence if encoding is UTF-8 or UTF-16.

Resources