unsigned char in C not working as expected - c

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133
unsigned char uc;
uc='à';
printf("%hhu \n",uc);
Instead, both clang and gcc produce the following error
error: character too large for enclosing character literal type
uc='à';
^
What went wrong?
By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.

Since unsigned char represents 0 - 255
This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.
and the extended ascii code for 'à' is 133,
There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.
I expected the following C code to print 133
In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error.
You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.
Change the code to:
#include <wchar.h>
....
wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);

Related

ASCII, ISO 8859-1, Unicode in C how does it work?

Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?
test1.c
// ISO 8859-1
#include <stdio.h>
int main(void)
{
unsigned char c = 'ÿ';
putchar(c);
return 0;
}
$ gcc -o test1 test1.c
$ ./test1
$ ▒
test2.c
// ASCII
#include <stdio.h>
int main(void)
{
FILE *fp = fopen("file.txt", "r+");
int c;
while((c = fgetc(fp)) != EOF)
putchar(c);
return 0;
}
file.txt: UTF-8
abcdefÿghi
$ gcc -o test2 test2.c
$ ./test2
$ abcdefÿghi
well, that's it, if you can help me giving details about it I would be very grateful, :)
Character encodings can be confusing for many reasons. Here are some explanations:
In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF (255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ is encoded in UTF-8 as 0xC3 0xBF.
When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ because the locale selected for the terminal also uses the UTF-8 encoding.
Adding to confusion, if the source file uses UTF-8 encoding, "ÿ" is a string of length 2 and 'ÿ' is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra).
Even more confusing is this: on many systems the type char signed by default. In this case, the character constant 'ÿ' (a single byte in ISO 8859-1) has a value of -1 and type int, no matter how you write it in the source code: '\377' and '\xff' will also have a value of -1. The reason for this is consistency with the value of "ÿ"[0], a char with the value -1. This is also the most common value of the macro EOF.
On all systems, getchar() and similar functions like getc() and fgetc() return values between 0 and UCHAR_MAX or the special negative value of EOF, so the byte 0xFF from a file where character ÿ in encoded as ISO 8859-1 is returned as the value 0xFF or 255, which compares different from 'ÿ' if char is signed, and also different from 'ÿ' if the source code is in UTF-8.
As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char unsigned by default (-funsigned-char).
If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.
The issue here is that unsigned char represents an unsigned integer of size 8 bits (from 0 to 255). C uses ASCII values to represent characters. An ASCII character is simply an integer from 0 to 127. For example, A is 65.
When you use 'A', the compiler understands 65. But, 'ÿ' is not an ASCII character, it is an extended ASCII character (with a value of 152). Technically, it can fit inside an unsigned char but the C standard requires that the syntax '' contains a standard ASCII character.
So that's why the first example didn't work.
Now for the second one. A non ASCII character cannot fit into a single char. The way you can handle characters outside the limited ASCII set is by using several chars. When you write ÿ into a file, you are actually writing a binary representation of this character. If you are using the UTF-8 reprensentation, this means that in you file you have two 8-bit numbers 0xC3 and 0xBF.
When you read your file in the while loop of test2.c, at some point, c will take the value 0xC3 and then 0xBF on the next iteration. These two values will be given to putc. And then, when displayed, the two values together will be interpreted as ÿ.
When putc finally writes the characters, they eventually are read by your terminal application. If it supports UTF-8 encoding, it can understand the meaning of 0xC3 followed by 0xBF and display a ÿ.
So the reason why, in the first example, you didn't see ÿ is that the value of c in your code is actually (probably) 0xC3 which doesn't reprensent any character.
A more concrete example:
#include <stdio.h>
int main()
{
char y[3] = { 0xC3, 0xBF, '\0' };
printf("%s\n", y);
}
This will display ÿ but as you can see, it takes 2 chars to do that.
if utf-8 uses the same 256 characters as ISO 8859-1. No there is a confusion here. In ISO-8859-1 (aka Latin1) the 256 characters have indeed the code point value of the corresponding Unicode character. But utf-8 have a special encoding for all characters above 0x7f and all characters having a code point between 0x80 and 0xff are represented as 2 bytes. For example the character é U+00e9 is represented as the single byte 0xe9 in ISO-8859-1, but is represented as the 2 bytes 0xc3 0xa9 in utf-8.
More references on the wikipedia page.
It's hard to reproduce on MacOS with clang:
$ gcc -o test1 test1.c
test1.c:6:23: warning: illegal character encoding in character literal [-Winvalid-source-encoding]
unsigned char c = '<FF>';
^
1 warning generated.
$ ./test1
?
$ gcc -finput-charset=iso-8859-1 -o test1 test1.c
clang: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'
clang on MacOS has UTF-8 as default.
Encoded in UTF-8:
$ gcc -o test1 test1.c
test1.c:6:23: error: character too large for enclosing character literal type
unsigned char c = 'ÿ';
^
1 error generated.
Debugging all warnings and errors we get a solution with the correct string literal and an array of bytes:
// UTF-8
#include <stdio.h>
// needed for correct strings
#include <string.h>
int main(void)
{
char c[] = "ÿ";
int len = strlen(c);
printf("len: %u c[0]: %u \n", len, (unsigned char)c[0] );
putchar(c[0]);
return 0;
}
$ ./test1
len: 2 c[0]: 195
?
Decimal 195 is hexadecimal C3, which is exactly the first byte of the UTF-8 byte sequence of the character ÿ:
$ uni identify ÿ
cpoint dec utf-8 html name
'ÿ' U+00FF 255 c3 bf ÿ LATIN SMALL LETTER Y WITH DIAERESIS (Lowercase_Letter)
^^ <-- HERE
Now we know that we must output 2 bytes and code:
char c[] = "ÿ";
int len = strlen(c);
for (int i=0; i < len; i++) {
putchar(c[i]);
}
printf("\n");
$ ./test1
ÿ
Program test2.c just reads bytes and outputs them. If the input is UTF-8 then the output is also UTF-8. This just keeps the encoding.
To convert Latin-1 to UTF-8 we need to pack it in a special way. For two bytes of UTF-8 we need a begin byte 110x xxxx (number of bits at the begin is the length of the sequence in bytes) and a continuation byte 10xx xxxx.
We can code now:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void)
{
uint8_t latin1 = 255; // code point of 'ÿ' U+00FF 255
uint8_t byte1 = 0b11000000 | ((latin1 & 0b11000000) >> 6);
uint8_t byte2 = 0b10000000 | (latin1 & 0b00111111);
putchar(byte1);
putchar(byte2);
printf("\n");
return 0;
}
$ ./test1
ÿ
This works only for ISO-8859-1 ("true" Latin-1). Many files called "Latin-1" are encoded in Windows/Microsoft CP1252.

Internals binary saving of C chars

Hey i stumbled about something pretty weird while programming. I tried to transform a utf8 char into a hexadecimal byte representation like 0x89 or 0xff.
char test[3] = "ü";
for (int x = 0; x < 3; x++){
printf("%x\n",test[x]);
}
And i get the following output :
ffffffc3
ffffffbc
0
I know that C uses one byte of data fore every one char and therefore if i want to store an weird char like "ü" they count as 2 chars.
Transforming ASCII Chars is no problem but once i get to non ASCII Chars (from germans to chinese) instead to getting outputs like 0xc3 and 0xbc c adds 0xFFFFFF00 to them.
I know that i can just do something like &0xFF and fix that weird representation, but i can wrap my head around why that keeps happening in the first place.
C allows type char to behave either as a signed type or as an unsigned type, as the C implementation chooses. You are observing the effect of it being a signed type, which is pretty common. When the char value of test[x] is passed to printf, it is promoted to type int, in value-preserving manner. When the value is negative, that involves sign-extension, whose effect is exactly what you describe. To avoid that, add an explicit cast to unsigned char:
printf("%x\n", (unsigned char) test[x]);
Note also that C itself does not require any particular characters outside the 7-bit ASCII range to be supported in source code, and it does not specify the execution-time encoding with which ordinary string contents are encoded. It is not safe to assume UTF-8 will be the execution character set, nor to assume that all compilers will accept UTF-8 source code, or will default to assuming that encoding even if they do support it.
The encoding of source code is a matter you need to sort out with your implementation, but if you are using at least C11 then you can ensure execution-time UTF-8 encoding for specific string literals by using UTF-8 literals, which are prefixed with u8:
char test[3] = u8"ü";
Be aware also that UTF-8 code sequences can be up to four bytes long, and most of the characters in the basic multilingual plane require 3. The safest way to declare your array, then, would be to let the compiler figure out the needed size:
// better
char test[] = u8"ü";
... and then to use sizeof to determine the size chosen:
for (int x = 0; x < sizeof(test); x++) {
// ...

String to ASCII code conversion and tie them

The code is c and compiling on gcc compiler.
How to append string and char like as following example
unsigned char MyString [] = {"LOREM IPSUM" + 0x28 + "DOLOR"};
unsigned char MyString [] = {"LOREM IPSUM\050DOLOR"};
The \050 is an octal escape sequence, with 050 == 0x28. The language standard also provides hex escape sequences, but "LOREM IPSUM\x28DOLOR" would be interpreted as a three-digit hex (\x28D), the meaning of which (since it would be overflowing the usual 8-bit char) would be implementation-defined. Octal escapes always end after three digits, which makes them safer to use.
And while we're at it, there is no guarantee whatsoever that your escapes would be considered ASCII. There are machines using EBCDIC natively, you know, and compilers defaulting to UTF-8 -- which would get you into trouble as soon as you go beyond 0x7f. ;-)

Special char Literals

I want to assign a char with a char literal, but it's a special character say 255 or 13.I know that I can assign my char with a literal int that will be cast to a char: char a = 13;I also know that Microsoft will let me use the hex code as a char literal: char a = '\xd'
I want to know if there's a way to do this that gcc supports also.
Writing something like
char ch = 13;
is mostly portable, to platforms on which the value 13 is the same thing as on your platform (which is all systems which uses the ASCII character set, which indeed is most systems today).
There may be platforms on which 13 can mean something else. However, using '\r' instead should always be portable, no matter the character encoding system.
Using other values, which does not have character literal equivalents, are not portable. And using values above 127 is even less portable, since then you're outside the ASCII table, and into the extended ASCII table, in which the letters can depend on the locale settings of the system. For example, western European and eastern European language settings will most likely have different characters in the 128 to 255 range.
If you want to use a byte which can contain just some binary data and not letters, instead of using char you might be wanting to use e.g. uint8_t, to tell other readers of your code that you're not using the variable for letters but for binary data.
The hexidecimal escape sequence is not specific to Microsoft. It's part of C/C++: http://en.cppreference.com/w/cpp/language/escape
Meaning that to assign a hexidecimal number to a char, this is cross platform code:
char a = '\xD';
The question already demonstrates assigning a decimal number to a char:
char a = 13;
And octal numbers can also be assigned as well, with only the escape switch:
char a = '\023';
Incidentally, '\0' is common in C/C++ to represent the null-character (independent of platform). '\0' is not a special character that can be escaped. That's actually invoking the octal escape sequence.

What does \x6d\xe3\x85 mean?

I don't know what is that, I found this in the openSSL source code.
Is those some sort of byte sequence? Basically I just need to convert my char * to that kind of style as a parameter.
It's a byte sequence in hexadecimal. \x6d\xe3\x85 is hex character 6d followed by hex e3, followed by hex 85. The syntax is \xnn where nn is your hex sequence.
If what you read was
char foo[] = "\x6d\xe3\x85";
then that is the same as
char foo[] = { 0x6d, 0xE3, 0x85, 0x00 };
Further, I can tell you that 0x6D is the ASCII code point for 'm', 0xE3 is the ISO 8859.1 code point for 'ã', and 0x85 is the Windows-1252 code point for '…'.
But without knowing more about the context, I can't tell you how to "convert [your] char * to that kind of style as a parameter", except to say that you might not need to do any conversion at all! The \x notation allows you to write string constants containing arbitrary byte sequences into your source code. If you already have an arbitrary byte sequence in a buffer in your program, I can't imagine your needing to back-convert it to \x notation before feeding it to OpenSSL.
Try the following code snippet to understand more about hex byte sequence..
#include <stdio.h>
int main(void)
{
char word[]="\x48\x65\x6c\x6c\x6f";
printf("%s\n", word);
return 0;
}
/*
Output:
$
$ ./a.out
Hello
$
*/
The \x sequence is used to escape byte values in hexadecimal notation. So the sequence you've cited escapes the bytes 6D, E3 and 85 which translate into 109, 227 and 133. While 6D can also be represented as the character m in ASCII, you cannot represent the later two in ASCII as it only covers the range 0..127. So for values beyond 127 you need a special way to write them, and the \x is such a way.
The other way is to escape as octal numbers using \<number>, for example 109 would be \155.
If you need explicit byte values it's better to use these escape sequences since (AFAIK) the C standard doesn't guarantee that your string will be encoded using ASCII. So when you compile for example on an EBCDIC system your m would be represented as byte value 148 instead of 109.

Resources