What does \x6d\xe3\x85 mean? - c

I don't know what is that, I found this in the openSSL source code.
Is those some sort of byte sequence? Basically I just need to convert my char * to that kind of style as a parameter.

It's a byte sequence in hexadecimal. \x6d\xe3\x85 is hex character 6d followed by hex e3, followed by hex 85. The syntax is \xnn where nn is your hex sequence.

If what you read was
char foo[] = "\x6d\xe3\x85";
then that is the same as
char foo[] = { 0x6d, 0xE3, 0x85, 0x00 };
Further, I can tell you that 0x6D is the ASCII code point for 'm', 0xE3 is the ISO 8859.1 code point for 'ã', and 0x85 is the Windows-1252 code point for '…'.
But without knowing more about the context, I can't tell you how to "convert [your] char * to that kind of style as a parameter", except to say that you might not need to do any conversion at all! The \x notation allows you to write string constants containing arbitrary byte sequences into your source code. If you already have an arbitrary byte sequence in a buffer in your program, I can't imagine your needing to back-convert it to \x notation before feeding it to OpenSSL.

Try the following code snippet to understand more about hex byte sequence..
#include <stdio.h>
int main(void)
{
char word[]="\x48\x65\x6c\x6c\x6f";
printf("%s\n", word);
return 0;
}
/*
Output:
$
$ ./a.out
Hello
$
*/

The \x sequence is used to escape byte values in hexadecimal notation. So the sequence you've cited escapes the bytes 6D, E3 and 85 which translate into 109, 227 and 133. While 6D can also be represented as the character m in ASCII, you cannot represent the later two in ASCII as it only covers the range 0..127. So for values beyond 127 you need a special way to write them, and the \x is such a way.
The other way is to escape as octal numbers using \<number>, for example 109 would be \155.
If you need explicit byte values it's better to use these escape sequences since (AFAIK) the C standard doesn't guarantee that your string will be encoded using ASCII. So when you compile for example on an EBCDIC system your m would be represented as byte value 148 instead of 109.

Related

ASCII, ISO 8859-1, Unicode in C how does it work?

Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?
test1.c
// ISO 8859-1
#include <stdio.h>
int main(void)
{
unsigned char c = 'ÿ';
putchar(c);
return 0;
}
$ gcc -o test1 test1.c
$ ./test1
$ ▒
test2.c
// ASCII
#include <stdio.h>
int main(void)
{
FILE *fp = fopen("file.txt", "r+");
int c;
while((c = fgetc(fp)) != EOF)
putchar(c);
return 0;
}
file.txt: UTF-8
abcdefÿghi
$ gcc -o test2 test2.c
$ ./test2
$ abcdefÿghi
well, that's it, if you can help me giving details about it I would be very grateful, :)
Character encodings can be confusing for many reasons. Here are some explanations:
In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF (255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ is encoded in UTF-8 as 0xC3 0xBF.
When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ because the locale selected for the terminal also uses the UTF-8 encoding.
Adding to confusion, if the source file uses UTF-8 encoding, "ÿ" is a string of length 2 and 'ÿ' is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra).
Even more confusing is this: on many systems the type char signed by default. In this case, the character constant 'ÿ' (a single byte in ISO 8859-1) has a value of -1 and type int, no matter how you write it in the source code: '\377' and '\xff' will also have a value of -1. The reason for this is consistency with the value of "ÿ"[0], a char with the value -1. This is also the most common value of the macro EOF.
On all systems, getchar() and similar functions like getc() and fgetc() return values between 0 and UCHAR_MAX or the special negative value of EOF, so the byte 0xFF from a file where character ÿ in encoded as ISO 8859-1 is returned as the value 0xFF or 255, which compares different from 'ÿ' if char is signed, and also different from 'ÿ' if the source code is in UTF-8.
As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char unsigned by default (-funsigned-char).
If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.
The issue here is that unsigned char represents an unsigned integer of size 8 bits (from 0 to 255). C uses ASCII values to represent characters. An ASCII character is simply an integer from 0 to 127. For example, A is 65.
When you use 'A', the compiler understands 65. But, 'ÿ' is not an ASCII character, it is an extended ASCII character (with a value of 152). Technically, it can fit inside an unsigned char but the C standard requires that the syntax '' contains a standard ASCII character.
So that's why the first example didn't work.
Now for the second one. A non ASCII character cannot fit into a single char. The way you can handle characters outside the limited ASCII set is by using several chars. When you write ÿ into a file, you are actually writing a binary representation of this character. If you are using the UTF-8 reprensentation, this means that in you file you have two 8-bit numbers 0xC3 and 0xBF.
When you read your file in the while loop of test2.c, at some point, c will take the value 0xC3 and then 0xBF on the next iteration. These two values will be given to putc. And then, when displayed, the two values together will be interpreted as ÿ.
When putc finally writes the characters, they eventually are read by your terminal application. If it supports UTF-8 encoding, it can understand the meaning of 0xC3 followed by 0xBF and display a ÿ.
So the reason why, in the first example, you didn't see ÿ is that the value of c in your code is actually (probably) 0xC3 which doesn't reprensent any character.
A more concrete example:
#include <stdio.h>
int main()
{
char y[3] = { 0xC3, 0xBF, '\0' };
printf("%s\n", y);
}
This will display ÿ but as you can see, it takes 2 chars to do that.
if utf-8 uses the same 256 characters as ISO 8859-1. No there is a confusion here. In ISO-8859-1 (aka Latin1) the 256 characters have indeed the code point value of the corresponding Unicode character. But utf-8 have a special encoding for all characters above 0x7f and all characters having a code point between 0x80 and 0xff are represented as 2 bytes. For example the character é U+00e9 is represented as the single byte 0xe9 in ISO-8859-1, but is represented as the 2 bytes 0xc3 0xa9 in utf-8.
More references on the wikipedia page.
It's hard to reproduce on MacOS with clang:
$ gcc -o test1 test1.c
test1.c:6:23: warning: illegal character encoding in character literal [-Winvalid-source-encoding]
unsigned char c = '<FF>';
^
1 warning generated.
$ ./test1
?
$ gcc -finput-charset=iso-8859-1 -o test1 test1.c
clang: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'
clang on MacOS has UTF-8 as default.
Encoded in UTF-8:
$ gcc -o test1 test1.c
test1.c:6:23: error: character too large for enclosing character literal type
unsigned char c = 'ÿ';
^
1 error generated.
Debugging all warnings and errors we get a solution with the correct string literal and an array of bytes:
// UTF-8
#include <stdio.h>
// needed for correct strings
#include <string.h>
int main(void)
{
char c[] = "ÿ";
int len = strlen(c);
printf("len: %u c[0]: %u \n", len, (unsigned char)c[0] );
putchar(c[0]);
return 0;
}
$ ./test1
len: 2 c[0]: 195
?
Decimal 195 is hexadecimal C3, which is exactly the first byte of the UTF-8 byte sequence of the character ÿ:
$ uni identify ÿ
cpoint dec utf-8 html name
'ÿ' U+00FF 255 c3 bf ÿ LATIN SMALL LETTER Y WITH DIAERESIS (Lowercase_Letter)
^^ <-- HERE
Now we know that we must output 2 bytes and code:
char c[] = "ÿ";
int len = strlen(c);
for (int i=0; i < len; i++) {
putchar(c[i]);
}
printf("\n");
$ ./test1
ÿ
Program test2.c just reads bytes and outputs them. If the input is UTF-8 then the output is also UTF-8. This just keeps the encoding.
To convert Latin-1 to UTF-8 we need to pack it in a special way. For two bytes of UTF-8 we need a begin byte 110x xxxx (number of bits at the begin is the length of the sequence in bytes) and a continuation byte 10xx xxxx.
We can code now:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void)
{
uint8_t latin1 = 255; // code point of 'ÿ' U+00FF 255
uint8_t byte1 = 0b11000000 | ((latin1 & 0b11000000) >> 6);
uint8_t byte2 = 0b10000000 | (latin1 & 0b00111111);
putchar(byte1);
putchar(byte2);
printf("\n");
return 0;
}
$ ./test1
ÿ
This works only for ISO-8859-1 ("true" Latin-1). Many files called "Latin-1" are encoded in Windows/Microsoft CP1252.

unsigned char in C not working as expected

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133
unsigned char uc;
uc='à';
printf("%hhu \n",uc);
Instead, both clang and gcc produce the following error
error: character too large for enclosing character literal type
uc='à';
^
What went wrong?
By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.
Since unsigned char represents 0 - 255
This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.
and the extended ascii code for 'à' is 133,
There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.
I expected the following C code to print 133
In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error.
You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.
Change the code to:
#include <wchar.h>
....
wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);

itoa providing 7-bit output to character input

I am trying to convert a character to its binary using inbuilt library (itoa) in C(gcc v-5.1) using example from Conversion of Char to Binary in C , but i'm getting a 7-bit output for a character input to itoa function.But since a character in C is essentially an 8-bit integer, i should get an 8-bit output.Can anyone explain why is it so??
code performing binary conversion:
enter for (temp=pt;*temp;temp++)
{
itoa(*temp,opt,2); //convert to binary
printf("%s \n",opt);
strcat(s1,opt); //add to encrypted text
}
PS:- This is my first question in stackoverflow.com, so sorry for any mistakes in advance.
You could use printf( "%2X\n", *opt ); to print a 8-bit value as 2 hexadecimal symbols.
It would print the first char of opt. Then, you must increment the pointer to the next char with opt++;.
The X means you want it to be printed as uppercase hexadecimal characters (use x for lowercase) and the 2 will make sure it will print 2 symbols even if opt is lesser than 0x10.
In other words, the value 0xF will be printed 0x0F... (actually 0F, you could use printf( "%#2X\n", *opt ); to print the 0x).
If you absolutely want a binary value you have to make a function that will print the right 0 and 1. There are many of them on the internet. If you want to make yours, reading about bitwise operations could help you (you have to know about bitwise operations if you want to work with binaries anyways).
Now that you can print it as desired (hex as above or with your own binary function), you can redirect the output of printf with the sprintf function.
Its prototype is int sprintf( char* str, const char* format, ... ). str is the destination.
In your case, you will just need to replace the strcat line with something like sprintf( s1, "%2X\n", *opt); or sprintf( s1, "%s\n", your_binary_conversion_function(opt) );.
Note that using the former, you would have to increment s1 by 2 for each char in opt because one 8-bit value is 2 hexadecimal symbols.
You might also have to manage s1's memory by yourself, if it was not the case before.
Sources :
MK27's second answer
sprintf prototype
The function itoa takes an int argument for the value to be converted. If you pass it a value of char type it will be promoted to int. So how would the function know how many leading zeros you were expecting? And then if you had asked for radix 10 how many leading zeros would you expect? Actually, it suppresses leading zeros.
See ASCII Table, note the hex column and that for the ASCII characters the msb is 0. Printable ASCII characters range from 0x20 thru 0x7f. Unicode shares the characters 0x00 thru 0x7f.
Hex 20 thru 7f are binary 00100000 thru 01111111.
Not all binary values are printable characters and in some encodings are not legal values.
ASCII, hexadecimal, octal and binary are just ways of representing binary values. Printable characters are another way but not all binary values can be displayed, this is the main data that needs to be displayed or treated as character text is generally converted to hex-ascii or Base64.

String to ASCII code conversion and tie them

The code is c and compiling on gcc compiler.
How to append string and char like as following example
unsigned char MyString [] = {"LOREM IPSUM" + 0x28 + "DOLOR"};
unsigned char MyString [] = {"LOREM IPSUM\050DOLOR"};
The \050 is an octal escape sequence, with 050 == 0x28. The language standard also provides hex escape sequences, but "LOREM IPSUM\x28DOLOR" would be interpreted as a three-digit hex (\x28D), the meaning of which (since it would be overflowing the usual 8-bit char) would be implementation-defined. Octal escapes always end after three digits, which makes them safer to use.
And while we're at it, there is no guarantee whatsoever that your escapes would be considered ASCII. There are machines using EBCDIC natively, you know, and compilers defaulting to UTF-8 -- which would get you into trouble as soon as you go beyond 0x7f. ;-)

Strange C behaviour [duplicate]

This question already has answers here:
Multi-character constant warnings
(6 answers)
Closed 9 years ago.
What is happening here?
#include <stdio.h>
int main (void)
{
int x = 'HELL';
printf("%d\n", x);
return 0;
}
Prints 1212501068
I expected a compiling error.
Explanations are welcome =)
1212501068 in hex is 0x48454c4c.
0x48 is the ASCII code for H.
0x45 is the ASCII code for E.
0x4c is the ASCII code for L.
0x4c is the ASCII code for L.
Note that this behaviour is implementation-defined and therefore not portable. A good compiler would issue a warning:
$ gcc test.c
test.c: In function 'main':
test.c:4:11: warning: multi-character character constant [-Wmultichar]
In C, single quotes are used to denote characters, which are represented in memory by numbers. When you place multiple characters in single quotes, the compiler combines them in a single value however it wants, as long as it documents the process.
Looking at your number, 1212501068 is 0x48454C4C. If you decompose this number into bytes, you get 48 or 'H', 45 or 'E' and twice 4C or 'L'
Others have explained what happened. As for the explanation, I quote from C99 draft standard (N1256):
6.4.4.4 Character constants
[...]
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
The emphasis on the relevant sentence is mine.
The output of 1212501068 as hex is: 0x48 0x45 0x4C 0x4C
Look it up in an ASCII table, and you'll see those are the code for HELL.
BTW: single-quotes around a multi-char value are not standardized.
The exact interpretation of single-quotes around multiple characters is Implementation-Defined. But it is very common that it either comes out as a Big-Endian or Little-Endian integer. (Technically, the implementation could interpret it any way it chooses, including a random value).
In otherwords, depending on the platform, I would not be surprised to see it come out as:0x4C 0x4C 0x45 0x48, or 1280066888
And over on this question, and also on this site you can see practical uses of this behavior.
Line:
int x = 'HELL';
save to memory hex values of 'HELL' and it is 0x48454c4c == 1212501068.
The value is just 'HELL' interpreted as an int (usually 4 bytes).
If you try this:
#include <stdio.h>
int main (void)
{
union {
int x;
char c[4];
} u;
int i;
u.x = 'HELL';
printf("%d\n", u.x);
for(i=0; i<4; i++) {
printf("'%c' %x\n", u.c[i], u.c[i]);
}
return 0;
}
You'll get:
1212501068
'L' 4c
'L' 4c
'E' 45
'H' 48

Resources