ASCII, ISO 8859-1, Unicode in C how does it work? - c

Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?
test1.c
// ISO 8859-1
#include <stdio.h>
int main(void)
{
unsigned char c = 'ÿ';
putchar(c);
return 0;
}
$ gcc -o test1 test1.c
$ ./test1
$ ▒
test2.c
// ASCII
#include <stdio.h>
int main(void)
{
FILE *fp = fopen("file.txt", "r+");
int c;
while((c = fgetc(fp)) != EOF)
putchar(c);
return 0;
}
file.txt: UTF-8
abcdefÿghi
$ gcc -o test2 test2.c
$ ./test2
$ abcdefÿghi
well, that's it, if you can help me giving details about it I would be very grateful, :)

Character encodings can be confusing for many reasons. Here are some explanations:
In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF (255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ is encoded in UTF-8 as 0xC3 0xBF.
When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ because the locale selected for the terminal also uses the UTF-8 encoding.
Adding to confusion, if the source file uses UTF-8 encoding, "ÿ" is a string of length 2 and 'ÿ' is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra).
Even more confusing is this: on many systems the type char signed by default. In this case, the character constant 'ÿ' (a single byte in ISO 8859-1) has a value of -1 and type int, no matter how you write it in the source code: '\377' and '\xff' will also have a value of -1. The reason for this is consistency with the value of "ÿ"[0], a char with the value -1. This is also the most common value of the macro EOF.
On all systems, getchar() and similar functions like getc() and fgetc() return values between 0 and UCHAR_MAX or the special negative value of EOF, so the byte 0xFF from a file where character ÿ in encoded as ISO 8859-1 is returned as the value 0xFF or 255, which compares different from 'ÿ' if char is signed, and also different from 'ÿ' if the source code is in UTF-8.
As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char unsigned by default (-funsigned-char).
If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.

The issue here is that unsigned char represents an unsigned integer of size 8 bits (from 0 to 255). C uses ASCII values to represent characters. An ASCII character is simply an integer from 0 to 127. For example, A is 65.
When you use 'A', the compiler understands 65. But, 'ÿ' is not an ASCII character, it is an extended ASCII character (with a value of 152). Technically, it can fit inside an unsigned char but the C standard requires that the syntax '' contains a standard ASCII character.
So that's why the first example didn't work.
Now for the second one. A non ASCII character cannot fit into a single char. The way you can handle characters outside the limited ASCII set is by using several chars. When you write ÿ into a file, you are actually writing a binary representation of this character. If you are using the UTF-8 reprensentation, this means that in you file you have two 8-bit numbers 0xC3 and 0xBF.
When you read your file in the while loop of test2.c, at some point, c will take the value 0xC3 and then 0xBF on the next iteration. These two values will be given to putc. And then, when displayed, the two values together will be interpreted as ÿ.
When putc finally writes the characters, they eventually are read by your terminal application. If it supports UTF-8 encoding, it can understand the meaning of 0xC3 followed by 0xBF and display a ÿ.
So the reason why, in the first example, you didn't see ÿ is that the value of c in your code is actually (probably) 0xC3 which doesn't reprensent any character.
A more concrete example:
#include <stdio.h>
int main()
{
char y[3] = { 0xC3, 0xBF, '\0' };
printf("%s\n", y);
}
This will display ÿ but as you can see, it takes 2 chars to do that.

if utf-8 uses the same 256 characters as ISO 8859-1. No there is a confusion here. In ISO-8859-1 (aka Latin1) the 256 characters have indeed the code point value of the corresponding Unicode character. But utf-8 have a special encoding for all characters above 0x7f and all characters having a code point between 0x80 and 0xff are represented as 2 bytes. For example the character é U+00e9 is represented as the single byte 0xe9 in ISO-8859-1, but is represented as the 2 bytes 0xc3 0xa9 in utf-8.
More references on the wikipedia page.

It's hard to reproduce on MacOS with clang:
$ gcc -o test1 test1.c
test1.c:6:23: warning: illegal character encoding in character literal [-Winvalid-source-encoding]
unsigned char c = '<FF>';
^
1 warning generated.
$ ./test1
?
$ gcc -finput-charset=iso-8859-1 -o test1 test1.c
clang: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'
clang on MacOS has UTF-8 as default.
Encoded in UTF-8:
$ gcc -o test1 test1.c
test1.c:6:23: error: character too large for enclosing character literal type
unsigned char c = 'ÿ';
^
1 error generated.
Debugging all warnings and errors we get a solution with the correct string literal and an array of bytes:
// UTF-8
#include <stdio.h>
// needed for correct strings
#include <string.h>
int main(void)
{
char c[] = "ÿ";
int len = strlen(c);
printf("len: %u c[0]: %u \n", len, (unsigned char)c[0] );
putchar(c[0]);
return 0;
}
$ ./test1
len: 2 c[0]: 195
?
Decimal 195 is hexadecimal C3, which is exactly the first byte of the UTF-8 byte sequence of the character ÿ:
$ uni identify ÿ
cpoint dec utf-8 html name
'ÿ' U+00FF 255 c3 bf ÿ LATIN SMALL LETTER Y WITH DIAERESIS (Lowercase_Letter)
^^ <-- HERE
Now we know that we must output 2 bytes and code:
char c[] = "ÿ";
int len = strlen(c);
for (int i=0; i < len; i++) {
putchar(c[i]);
}
printf("\n");
$ ./test1
ÿ
Program test2.c just reads bytes and outputs them. If the input is UTF-8 then the output is also UTF-8. This just keeps the encoding.
To convert Latin-1 to UTF-8 we need to pack it in a special way. For two bytes of UTF-8 we need a begin byte 110x xxxx (number of bits at the begin is the length of the sequence in bytes) and a continuation byte 10xx xxxx.
We can code now:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void)
{
uint8_t latin1 = 255; // code point of 'ÿ' U+00FF 255
uint8_t byte1 = 0b11000000 | ((latin1 & 0b11000000) >> 6);
uint8_t byte2 = 0b10000000 | (latin1 & 0b00111111);
putchar(byte1);
putchar(byte2);
printf("\n");
return 0;
}
$ ./test1
ÿ
This works only for ISO-8859-1 ("true" Latin-1). Many files called "Latin-1" are encoded in Windows/Microsoft CP1252.

Related

What actually is the type of C `char **argv` on Windows

From reading docs in either MSDN or the n1256 committee draft, I was under the impression that a char would always be exactly CHAR_BIT bits as defined in <limits.h>.
If CHAR_BIT is set to 8, then a byte is 8 bits long, and so is a char.
Test code
Given the following C code:
int main(int argc, char **argv) {
int length = 0;
while (argv[1][length] != '\0') {
// print the character, its hexa value, and its size
printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
length,
argv[1][length],
argv[1][length],
sizeof argv[1][length]);
length++;
}
printf("\nTotal length: %u\n", length);
printf("Actual char size: %u\n", CHAR_BIT);
return 0;
}
I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç and à.
Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça has a length of 3 for example (4 if counting the \0) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).
Output
$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t value: 0x74 sizeof char: 1
char 1: e value: 0x65 sizeof char: 1
char 2: s value: 0x73 sizeof char: 1
char 3: t value: 0x74 sizeof char: 1
char 4: _ value: 0x5F sizeof char: 1
char 5: τ value: 0xFFFFFFE7 sizeof char: 1
char 6: α value: 0xFFFFFFE0 sizeof char: 1
Total length: 7
Actual char size: 8
Question
What is probably happening under the hood is char **argv is turned into int **argv. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.
Is that what actually happens ?
Is it standard behaviour ?
Why chars 5 and 6 are not what is given as input ?
CHAR_BIT == 8 and sizeof(achar) == 1 and somechar = 0xFFFFFFE7. This seems counter-intuitive. What's happening ?
Environment
Windows 10
Terminal: Alacritty and Windows default cmd (tried in both just in case)
GCC under Mingw-w64
No, it's not received as an array of int.
But it's not far from the truth: printf is indeed receiving the char as an int.
When passing an integer type small than an int to a vararg function like printf, it gets promoted to an int. On your system, char is a signed type.[1] Given a char with a value of -25, an int with a value of -25 was passed to printf. %u expects an unsigned int, so it's treating the int with a value of -25 as an unsigned int, printing 0xFFFFFFE7.
A simple fix:
printf("%X\n", (unsigned char)c); // 74 65 73 74 5F E7 E0
But why did you get E7 and E0 in the first place?
Each Windows system call that deals with text has two versions:
An "ANSI" (A) version that deals with text encoded using the system's Active Code Page.[2] For en-us installs of Windows, this is cp1252.
And a Wide (W) version that deals with text encoded using UTF-16le.
The command line is being obtained from the system using GetCommandLineA, the A version of GetCommandLine. Your system uses cp1252 as its ACP. Encoded using cp1252, ç is E7, and à is E0.
GetCommandLineW will provide the command line as UTF-16le, and CommandLineToArgvW will parse it.
Finally, why did E7 and E0 show as τ and α?
The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ is E7, and α is E0.
Issuing chcp 1252 will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)
You can query the terminal's encoding using GetConsoleCP (for input) and GetConsoleOutputCP (for output). Yeah, apparently they can be different? I don't know how that would happen in practice.
It's up the compiler whether char is a signed or unsigned type.
This can be changed on a per program basis since Windows 10, Version 1903 (May 2019 Update).
From your code and the output on your system, it appears that:
type char has indeed 8 bits. Its size is 1 by definition. char **argv is a pointer to an array of pointers to C strings, null terminated arrays of char (8-bit bytes).
the char type is signed for your compiler configuration, hence the output 0xFFFFFFE7 and 0xFFFFFFE0 for values beyond 127. char values are passed as int to printf, which interprets the value as unsigned for the %X conversion. The behavior is technically undefined, but in practice negative values are offset by 232 when used as unsigned. You can configure gcc to make the char type unsigned by default with -funsigned-char, a safer choice that is also more consistent with the C library behavior.
the 2 non ASCII characters çà are encoded as single bytes E7 and E0, which correspond to Microsoft's proprietary encoding, their code page Windows-1252, not UTF-8 as you assume.
The situation is ultimately confusing: the command line argument is passed to the program encoded with the Windows-1252 code page, but the terminal uses the old MS/DOS code page 437 for compatibility with historic stuff. Hence your program outputs the bytes it receives as command line arguments, but the terminal shows the corresponding characters from CP437, namely τ and α.
Microsoft made historic decisions regarding the encoding of non ASCII characters that seem obsolete by today's standards, it is a shame they seem stuck with cumbersome choices other vendors have steered away from for good reasons. Programming in C in this environment is a rough road.
UTF-8 was invented in September of 1992 by Unix team leaders Kenneth Thomson and Rob Pike. They implemented it in plan-9 overnight as it had a number of interesting properties for compatibility with the C language character strings. Microsoft had already invested millions in their own system and ignored this simpler approach, which has become ubiquitous on the web today.

unsigned char in C not working as expected

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133
unsigned char uc;
uc='à';
printf("%hhu \n",uc);
Instead, both clang and gcc produce the following error
error: character too large for enclosing character literal type
uc='à';
^
What went wrong?
By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.
Since unsigned char represents 0 - 255
This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.
and the extended ascii code for 'à' is 133,
There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.
I expected the following C code to print 133
In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error.
You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.
Change the code to:
#include <wchar.h>
....
wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);

itoa providing 7-bit output to character input

I am trying to convert a character to its binary using inbuilt library (itoa) in C(gcc v-5.1) using example from Conversion of Char to Binary in C , but i'm getting a 7-bit output for a character input to itoa function.But since a character in C is essentially an 8-bit integer, i should get an 8-bit output.Can anyone explain why is it so??
code performing binary conversion:
enter for (temp=pt;*temp;temp++)
{
itoa(*temp,opt,2); //convert to binary
printf("%s \n",opt);
strcat(s1,opt); //add to encrypted text
}
PS:- This is my first question in stackoverflow.com, so sorry for any mistakes in advance.
You could use printf( "%2X\n", *opt ); to print a 8-bit value as 2 hexadecimal symbols.
It would print the first char of opt. Then, you must increment the pointer to the next char with opt++;.
The X means you want it to be printed as uppercase hexadecimal characters (use x for lowercase) and the 2 will make sure it will print 2 symbols even if opt is lesser than 0x10.
In other words, the value 0xF will be printed 0x0F... (actually 0F, you could use printf( "%#2X\n", *opt ); to print the 0x).
If you absolutely want a binary value you have to make a function that will print the right 0 and 1. There are many of them on the internet. If you want to make yours, reading about bitwise operations could help you (you have to know about bitwise operations if you want to work with binaries anyways).
Now that you can print it as desired (hex as above or with your own binary function), you can redirect the output of printf with the sprintf function.
Its prototype is int sprintf( char* str, const char* format, ... ). str is the destination.
In your case, you will just need to replace the strcat line with something like sprintf( s1, "%2X\n", *opt); or sprintf( s1, "%s\n", your_binary_conversion_function(opt) );.
Note that using the former, you would have to increment s1 by 2 for each char in opt because one 8-bit value is 2 hexadecimal symbols.
You might also have to manage s1's memory by yourself, if it was not the case before.
Sources :
MK27's second answer
sprintf prototype
The function itoa takes an int argument for the value to be converted. If you pass it a value of char type it will be promoted to int. So how would the function know how many leading zeros you were expecting? And then if you had asked for radix 10 how many leading zeros would you expect? Actually, it suppresses leading zeros.
See ASCII Table, note the hex column and that for the ASCII characters the msb is 0. Printable ASCII characters range from 0x20 thru 0x7f. Unicode shares the characters 0x00 thru 0x7f.
Hex 20 thru 7f are binary 00100000 thru 01111111.
Not all binary values are printable characters and in some encodings are not legal values.
ASCII, hexadecimal, octal and binary are just ways of representing binary values. Printable characters are another way but not all binary values can be displayed, this is the main data that needs to be displayed or treated as character text is generally converted to hex-ascii or Base64.

String to ASCII code conversion and tie them

The code is c and compiling on gcc compiler.
How to append string and char like as following example
unsigned char MyString [] = {"LOREM IPSUM" + 0x28 + "DOLOR"};
unsigned char MyString [] = {"LOREM IPSUM\050DOLOR"};
The \050 is an octal escape sequence, with 050 == 0x28. The language standard also provides hex escape sequences, but "LOREM IPSUM\x28DOLOR" would be interpreted as a three-digit hex (\x28D), the meaning of which (since it would be overflowing the usual 8-bit char) would be implementation-defined. Octal escapes always end after three digits, which makes them safer to use.
And while we're at it, there is no guarantee whatsoever that your escapes would be considered ASCII. There are machines using EBCDIC natively, you know, and compilers defaulting to UTF-8 -- which would get you into trouble as soon as you go beyond 0x7f. ;-)

What does \x6d\xe3\x85 mean?

I don't know what is that, I found this in the openSSL source code.
Is those some sort of byte sequence? Basically I just need to convert my char * to that kind of style as a parameter.
It's a byte sequence in hexadecimal. \x6d\xe3\x85 is hex character 6d followed by hex e3, followed by hex 85. The syntax is \xnn where nn is your hex sequence.
If what you read was
char foo[] = "\x6d\xe3\x85";
then that is the same as
char foo[] = { 0x6d, 0xE3, 0x85, 0x00 };
Further, I can tell you that 0x6D is the ASCII code point for 'm', 0xE3 is the ISO 8859.1 code point for 'ã', and 0x85 is the Windows-1252 code point for '…'.
But without knowing more about the context, I can't tell you how to "convert [your] char * to that kind of style as a parameter", except to say that you might not need to do any conversion at all! The \x notation allows you to write string constants containing arbitrary byte sequences into your source code. If you already have an arbitrary byte sequence in a buffer in your program, I can't imagine your needing to back-convert it to \x notation before feeding it to OpenSSL.
Try the following code snippet to understand more about hex byte sequence..
#include <stdio.h>
int main(void)
{
char word[]="\x48\x65\x6c\x6c\x6f";
printf("%s\n", word);
return 0;
}
/*
Output:
$
$ ./a.out
Hello
$
*/
The \x sequence is used to escape byte values in hexadecimal notation. So the sequence you've cited escapes the bytes 6D, E3 and 85 which translate into 109, 227 and 133. While 6D can also be represented as the character m in ASCII, you cannot represent the later two in ASCII as it only covers the range 0..127. So for values beyond 127 you need a special way to write them, and the \x is such a way.
The other way is to escape as octal numbers using \<number>, for example 109 would be \155.
If you need explicit byte values it's better to use these escape sequences since (AFAIK) the C standard doesn't guarantee that your string will be encoded using ASCII. So when you compile for example on an EBCDIC system your m would be represented as byte value 148 instead of 109.

Resources