are int and char represented using the same bits internally by gcc? - c

I was playing around with unicode characters (without using wchar_t support) just for fun. I'm only using the regular char data type. I noticed that while printing them in hex they were showing up full 4 bytes instead of just one byte.
For ex. consider this c file:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char *s = (char *) malloc(100);
fgets(s, 100, stdin);
while (s && *s != '\0') {
printf("%x\n", *s);
s++;
}
return 0;
}
After compiling with gcc and giving input as 'cent' symbol (hex: c2 a2) I get the following output
$ ./a.out
¢
ffffffc2: ?
ffffffa2: ?
a:
So instead of just printing c2 and a2 I got the whole 4 bytes as if it's an int type.
Does this mean char is not really 1-byte in length, ascii made it look like 1-byte?

Maybe the reason why the upper three bytes become 0xFFFFFF needs a bit more explanation?
The upper three bytes of the value printed for *s have a value of 0xFF due to sign extension.
The char value passed to printf is extended to an int before the call to printf.
This is due to C's default behaviour.
In the absence of signed or unsigned, the compiler can default to interpret char as signed char or unsigned char. It is consistently one or the other unless explicitly changed with a command line option or pragma's. In this case we can see that it is signed char.
In the absence of more information (prototypes or casts), C passes:
int, so char, short, unsigned char unsigned short are converted to int. It never passes a char, unsigned char, signed char, as a single byte, it always passes an int.
unsigned int is the same size as int so the value is passed without change
The compiler needs to decide how to convert the smaller value to an int.
signed values: the upper bytes of the int are sign extended from the smaller value, which effectively copies the top, sign bit, upwards to fill the int. If the top bit of the smaller signed value is 0, the upper bytes are filled with 0. If the top bit of the smaller signed value is 1, the upper bytes are filled with 1. Hence printf("%x ",*s) prints ffffffc2
unsigned values are not sign extended, the upper bytes of the int are 'zero padded'
Hence the reason C can call a function without a prototype (though the compiler will usually warn about that)
So you can write, and expect this to run (though I would hope your compiler issues warnings):
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
signed char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%x schar[1]=%x uchar[0]=%x uchar[1]=%x\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
That prints:
schar[0]=70 schar[1]=ffffff80 uchar[0]=70 uchar[1]=80
The char value is interpreted by my (Mac's gcc) compiler as signed char, so the compiler generates code to sign extended the char to the int before the printf call.
Where the signed char value has its top (sign) bit set (\x80), the conversion to int sign extends the char value. The sign extension fills in the upper bytes (in this case 3 more bytes to make a 4 byte int) with 1's, which get printed by printf as ffffff80
Where the signed char value has its top (sign) bit clear (\x70), the conversion to int still sign extends the char value. In this case the sign is 0, so the sign extension fills in the upper bytes with 0's, which get printed by printf as 70
My example shows the case where the value is unsigned char. In these two cases the value is not sign extended because the value is unsigned. Instead they are extended to int with 0 padding. It might look like printf is only printing one byte because the adjacent three bytes of the value would be 0. But it is printing the entire int, it happens that the value is 0x00000070 and 0x00000080 because the unsigned char values were converted to
int without sign extension.
You can force printf to only print the low byte of the int, by using suitable formatting (%hhx), so this correctly prints only the value in the original char:
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%hhx schar[1]=%hhx uchar[0]=%hhx uchar[1]=%hhx\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
This prints:
schar[0]=70 schar[1]=80 uchar[0]=70 uchar[1]=80
because printf interprets the %hhx to treat the int as an unsigned char. This does not change the fact that the char was sign extended to an int before printf was called. It is only a way to tell printf how to interpret the contents of the int.
In a way, for signed char *schar, the meaning of %hhx looks slightly misleading, but the '%x' format interprets int as unsigned anyway, and (with my printf) there is no format to print hex for signed values (IMHO it would be a confusing).
Sadly, ISO/ANSI/... don't freely publish our programming language standards, so I can't point to the specification, but searching the web might turn up working drafts. I haven't tried to find them. I would recommend "C: A Reference Manual" by Samuel P. Harbison and Guy L. Steele as a cheaper alternative to the ISO document.
HTH

No. printf is a variable argument function, arguments to a variable argument function will be promoted to an int. And in this case the char was negative, so it gets sign extended.

%x tells printf that the value to print is an unsigned int. So, it promotes the char to an unsigned int, sign extending as necessary and then prints out the resulting value.

Related

Convert raw ASCII data to Hex string

I have the following code to convert raw ASCII data to Hex string. The full c code can be found here
void str2hex(char* inputStr, char* outputStr)
{
int i;
int counter;
i=0;
counter=0;
while(inputStr[counter] != '\0')
{
sprintf((char*)(outputStr+i),"%02X", inputStr[counter]);
i+=2; counter+=1;
}
outputStr[i++] = '\0';
}
It works fine for most of the values. But when I am trying the following input from terminal using echo as stdin echo 11223344556677881122334455667788|xxd -r -p| ./CProgram --stdin
11223344556677881122334455667788
It returns the following output
11223344556677FF11223344556677FF
As it can be seen instead of 88 it returns FF.
How can I adjust this code to get 88 instead of FF.
There are multiple issues all coalescing into your problem.
The first issue is that it's compiler-defined if char is a signed or unsigned integer type. Your compiler seem to have signed char types.
The second issue is that on most systems today, signed integers are represented using two's complement, where the most significant bit indicates the sign.
The third issue is that vararg functions like printf will do default argument promotion of its arguments. That means types smaller than int will be promoted to int. And that promotion will keep the value of the converted integer, which means negative values will be sign-extended. Sign-extension means that the most significant bit will be copied all the way to the "top" when extending the value. That means the signed byte 0xff will be extended to 0xffffffff when promoted to an int.
Now when your code tries to convert the byte 0x88 it will be treated as the negative number -120, not 136 as you might expect.
There are two possible solutions to this:
Explicitly use unsigned char for the input string:
void str2hex(const unsigned char* inputStr, char* outputStr);
Use the hh prefix in the printf format:
sprintf((char*)(outputStr+i),"%02hhX", inputStr[counter]);
This tells sprintf that the argument is a single byte, and will mask out the upper bits of the (promoted) integer.

How many 'char' types are there in C?

I have been reading "The C Programming Language" book by "KnR", and i've come across this statement:
"plain chars are signed or unsigned"
So my question is, what is a plain char and how is it any different from
signed char and unsigned char?
In the below code how is 'myPlainChar' - 'A' different from
'mySignChar' - 'A' and 'myUnsignChar' - 'A'?
Can someone please explain me the statement "Printable char's are
always positive".
Note: Please write examples and explain. Thank you.
{
char myChar = 'A';
signed char mySignChar = 'A';
unsigned char myUnsignChar = 'A';
}
There are signed char and unsigned char. Whether char is signed or unsigned by default depends on compiler and its settings. Usually it is signed.
There is only one char type, just like there is only one int type.
But like with int you can add a modifier to tell the compiler if it's an unsigned or a signed char (or int):
signed char x1; // x1 can hold values from -128 to +127 (typically)
unsigned char x2; // x2 can hold values from 0 to +255 (typically)
signed int y1; // y1 can hold values from -2147483648 to +2147483647 (typically)
unsigned int y2; // y2 can hold values from 0 to +4294967295 (typically)
The big difference between plain unmodified char and int is that int without a modifier will always be signed, but it's implementation defined (i.e. it's up to the compiler) if char without a modifier is signed or unsigned:
char x3; // Could be signed, could be unsigned
int y3; // Will always be signed
Plain char is the type spelled char without signed or unsigned prefix.
Plain char, signed char and unsigned char are three distinct integral types (yes, character values are (small) integers), even though plain char is represented identically to one of the other two. Which one is implementation defined. This is distinct from say int : plain int is always the same as signed int.
There's a subtle point here: if plain char is for example signed, then it is a signed type, and we say "plain char is signed on this system", but it's still not the same type as signed char.
The difference between these two lines
signed char mySignChar = 'A';
unsigned char myUnsignChar = 'A';
is exactly the same as the difference between these two lines:
signed int mySignInt = 42;
unsigned int myUnsignInt = 42;
The statement "Printable char's are always positive" means exactly what it says. On some systems some plain char values are negative. On all systems some signed char values are negative. On all systems there is a character of each kind that is exactly zero. But none of those are printable. Unfortunately the statement is not necessarily correct (it is correct about all characters in the basic execution character set, but not about the extended execution character set).
How many char types are there in C?
There is one char type. There are 3 small character types: char, signed char, unsigned char. They are collectively called character types in C.
char has the same range/size/ranking/encoding as signed char or unsigned char, yet is a distinct type.
what is a plain char and how is it any different from signed char and unsigned char?
They are 3 different types in C. A plain char char will match the same range/size/ranking/encoding as either singed char or unsigned char. In all cases the size is 1.
2 .how is myPlainChar - 'A' different from mySignChar - 'A' and myUnsignChar - 'A'?
myPlainChar - 'A' will match one of the other two.
Typically mySignChar has a value in the range [-128...127] and myUnsignChar in the range of [0...255]. So a subtraction of 'A' (typically a value of 65) will result a different range of potential answers.
Can someone please explain me the statement "Printable char's are always positive".
Portable C source code characters (the basic
execution character set) are positive so printing a source code file only prints characters of non-negative values.
When printing data with printf("%c", some_character_type) or putc(some_character_type) the value, either positive or negative is converted to an unsigned char before printing. Thus it is a character associated with a non-negative value that is printed.
C has isprint(int c) which "tests for any printing character including space". That function is only valid for values in the unsigned char range and the negative EOF. isprint(EOF) reports 0. So only non-negative values pass the isprint(int c) test.
C really has no way to print negative values as characters without undergoing a conversion to unsigned char.
I think it means char without 'unsigned' in front of it ie:
unsigned char a;
as opposed to
char a; // signed char
So basically a variable is always signed (for integers and char) unless you use the statement 'unsigned'.
That should answer the second question as well.
The third question: Characters that are in the ascii set are defined as unsigned characters, ie the number -60 doesn't represent a character, but 65 does, ie 'A'.

Collateral effect using the sprintf function

How come when I use the sprintf function somehow the variable A value changed?
#include <stdio.h>
int main(void) {
short int A = 8000;
char byte_1[2] /*0001 1111 01000 0000*/, total[4];
sprintf(byte_1, "%i", A);
printf("%s\n", byte_1);// displayed on the screen 8000
printf("%i\n", A); // displayed on the screen 12336
}
byte_1 is too short to receive the representation of A in decimal: it only has space for 1 digit and the null terminator and sprintf does not have this information, so it will attempt to write beyond the end of the byte_1 array, causing undefined behavior.
make byte_1 larger, 12 bytes is a good start.
sprintf is inherenty unsafe. Use snprintf that protects against buffer overrun:
snprintf(byte_1, sizeof byte_1, "%i", A);
Here is a potential explanation for this unexpected output: imagine byte_1 is located in memory just before A. sprintf converts the value of A to five characters '8', '0', '0', '0' and '\0' that overflows the end of byte_1, and overwrites the value of variable A itself. When you later print the value of A with printf, A no longer has value 8000, but rather 12336... Just one of an infinite range of possible effects of undefined behavior.
Try this corrected version:
#include <stdio.h>
int main(void) {
short int A = 8000;
char byte_1[12], total[4];
snprintf(byte_1, sizeof byte_1, "%i", A);
printf("%s\n", byte_1);
printf("%i\n", A);
return 0;
}
The text representation of the value stored in A is ”8000” - that’s four characters plus the string terminator, so byte_1 needs to be at least 5 characters wide. If you want byte_1 to store the representation of any unsigned int, you should make it more like 12 characters wide:
char byte_1[12];
Two characters is not enough to store the string ”8000”, so whensprintf writes to byte_1, those extra characters are most likely overwriting A.
Also note that the correct conversion specifier for an unsigned int is %u, not %i. This will matter when trying to format very large unsigned values where the most significant bit is set. %i will attempt to format that as a negative signed value.
Edit
As chrqlie pointed out, the OP had declared A as short int - for some reason, another answer had changed that to unsigned int and that stuck in my head. Strictly speaking, the correct conversion specifier for a short int is %hd if you want signed decimal output.
For the record, here's a list of some common conversion specifiers and their associated types:
Specifier Argument type Output
--------- ------------- ------
i,d int Signed decimal integer
u unsigned int Unsigned decimal integer
x,X unsigned int Unsigned hexadecimal integer
o unsigned int Unsigned octal integer
f float, double Signed decimal float
s char * Text string
c char Single character
p void * Pointer value, implementation-defined
For short and long types, there are some length modifiers:
Specifier Argument type Output
--------- ------------- ------
hd short signed decimal integer
hhd char signed decimal integer
ld long signed decimal integer
lld long long signed decimal integer
Those same modifiers can be applied to u, x, X, o, etc.
byte_1 is too small for the four digits of "A". It only has enough room for a single digit, and the null (\0) terminator. If you make byte_1 an array of 5 bytes, one for each digit and the null byte, it will be able to fit "A".
#include <stdio.h>
int main(void) {
unsigned int A = 8000;
char byte_1[5], total[4];
sprintf(byte_1, "%i", A);
printf("%s\n", byte_1);
printf("%i\n", A);
return 0;
}
Basically, messing around with memory and trying to put values into variables that are too small for them, is undefined behavior. This is legal but objectively dangerous in C, and no program should be accessing memory like this.
sprintf(byte_1, "%i", A);
Format specifier needs to agree to the variable type.
I suggest the following change:
sprintf(byte_1, "%c", A);
printf("%c\n", byte_1);
EDIT: So an additional change after performing the change above, is to also change A so it is of the same type as byte_1. This will force you to change the value in your example to match the range of char types. Notice that using a function to protect you for overflowing is just a bad solution. Instead, it is your responsibility as a designer of this code to choose the proper tools for the job. When working with char variables, you need to use char-like containers. Same goes with integers, floats, strings, etc. If you have a 1 kilogram of sugar, you want to use a 1kg container to hold this amount. You wouldn't use a cup (250g) as, as you see, it overflows. Happy codding in C!

Since characters from -128 to -1 are same as from +128 to +255, then what is the point of using unsigned char?

#include <stdio.h>
#include <conio.h>
int main()
{
char a=-128;
while(a<=-1)
{
printf("%c\n",a);
a++;
}
getch();
return 0;
}
The output of the above code is same as the output of the code below
#include <stdio.h>
#include <conio.h>
int main()
{
unsigned char a=+128;
while(a<=+254)
{
printf("%c\n",a);
a++;
}
getch();
return 0;
}
Then why we use unsigned char and signed char?
K & R, chapter and verse, p. 43 and 44:
There is one subtle point about the conversion of characters to
integers. The language does not specify whether variables of type char
are signed or unsigned quantities. When a char is converted to an int,
can it ever produce a negative integer? The answer varies from machine
to machine, reflecting differences in architecture. On some machines,
a char whose leftmost bit is 1 will be converted to a negative integer
("sign extension"). On others, a char is promoted to an int by adding
zeros at the left end, and thus is always positive. [...] Arbitrary
bit patterns stored in character variables may appear to be negative
on some machines, yet positive on others. For portability, specify
signed or unsigned if non-character data is to be stored in char
variables.
With printing characters - no difference:
The function printf() uses "%c" and takes the int argument and converts it to unsigned char and then prints it.
char a;
printf("%c\n",a); // a is converted to int, then passed to printf()
unsigned char ua;
printf("%c\n",ua); // ua is converted to int, then passed to printf()
With printing values (numbers) - difference when system uses a char that is signed:
char a = -1;
printf("%d\n",a); // --> -1
unsigned char ua = -1;
printf("%d\n",ua); // --> 255 (Assume 8-bit unsigned char)
Note: Rare machines will have int the same size as char and other concerns apply.
So if code uses a as a number rather than a character, the printing differences are significant.
The bit representation of a number is what the computer stores, but it doesn't mean anything without someone (or something) imposing a pattern onto it.
The difference between the unsigned char and signed char patterns is how we interpret the set bits. In one case we decide that zero is the smallest number and we can add bits until we get to 0xFF or binary 11111111. In the other case we decide that 0x80 is the smallest number and we can add bits until we get to 0x7F.
The reason we have the funny way of representing signed numbers (the latter pattern) is because it places zero 0x00 roughly in the middle of the sequence, and because 0xFF (which is -1, right before zero) plus 0x01 (which is 1, right after zero) add together to carry until all the bits carry off the high end leaving 0x00 (-1 + 1 = 0). Likewise -5 + 5 = 0 by the same mechanisim.
For fun, there are a lot of bit patterns that mean different things. For example 0x2a might be what we call a "number" or it might be a * character. It depends on the context we choose to impose on the bit patterns.
Because unsigned char is used for one byte integer in C89.
Note there are three distinct char related types in C89: char, signed char, unsigned char.
For character type, char is used.
unsigned char and signed char are used for one byte integers like short is used for two byte integers. You should not really use signed char or unsigned char for characters. Neither should you rely on the order of those values.
Different types are created to tell the compiler how to "understand" the bit representation of one or more bytes. For example, say I have a byte which contains 0xFF. If it's interpreted as a signed char, it's -1; if it's interpreted as a unsigned char, it's 255.
In your case, a, no matter whether signed or unsigned, is integral promoted to int, and passed to printf(), which later implicitly convert it to unsigned char before printing it out as a character.
But let's consider another case:
#include <stdio.h>
#include <string.h>
int main(void)
{
char a = -1;
unsigned char b;
memmove(&b, &a, 1);
printf("%d %u", a, b);
}
It's practically acceptable to simply write printf("%d %u", a, a);. memmove() is used just to avoid undefined behaviour.
It's output on my machine is:
-1 4294967295
Also, think about this ridiculous question:
Suppose sizeof (int) == 4, since arrays of characters (unsigned
char[]){UCHAR_MIN, UCHAR_MIN, UCHAR_MIN, UCHAR_MIN} to (unsigned
char[]){UCHAR_MAX, UCHAR_MAX, UCHAR_MAX, UCHAR_MAX} are same as
unsigned ints from UINT_MIN to UINT_MAX, then what is the point
of using unsigned int?

printf adds extra `FFFFFF` to hex print from a char array [duplicate]

This question already has answers here:
Why does printf not print out just one byte when printing hex?
(5 answers)
Closed 6 years ago.
Consider the following simplified code bellow. I want to extract some binary data/stream from a file and print it to the standard output in Hexadecimal format.
I got extra 3 bytes 0xFFFFFF. What's wrong? From where did the extra bytes come?
output
in:
2000FFFFFFAF00690033005A00
out:
2000FFFFFFAF00690033005A00
program.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i;
char raw[10] = {0x20,0x00,0xAF,0x00,0x69,0x00,0x33,0x00,0x5A,0x00};
FILE *outfile;
char *buf;
printf("in:\n\t");
for( i=0; i<10; i++ )
printf("%02X", raw[i]);
outfile = fopen("raw_data.bin", "w+b");
fwrite(raw, 1, 10, outfile);
buf = (char *) malloc (32 * sizeof(char));
fseek(outfile, 0, SEEK_SET);
fread(buf, 1, 10, outfile);
printf("\nout:\n\t");
for( i=0; i<10; i++ )
printf("%02X", buf[i]);
printf("\n");
fclose(outfile);
return 0;
}
Sign extension. Your compiler is implementing char as a signed char. When you pass the chars to printf they are all being sign extended during their promotion to ints. When the first bit is a 0 this doesn't matter, because it gets extended with 0s.
0xAF in binary is 10101111 Since the first bit is a 1, when passing it to printf it is extended with all 1s in the conversion to int making it 11111111111111111111111110101111, which is 0xFFFFFFAF, the hex value you have.
Solution: Use unsigned char (instead of char) to prevent the sign extension from occurring in the call
const unsigned char raw[] = {0x20,0x00,0xAF,0x00,0x69,0x00,0x33,0x00,0x5A,0x00};
All of these values in your original example are being sign extended, it's just that 0xAF is the only one with a 1 in the first bit.
Another simpler example of the same behavior (live link):
signed char c = 0xAF; // probably gives an overflow warning
int i = c; // extra 24 bits are all 1
assert( i == 0xFFFFFFAF );
That's because 0xAF when converted from a signed character to a signed integer is negative (it is sign extended), and the %02X format is for unsigned arguments and prints the converted value as FFFFFFAF.
The extra characters appear because printf %x will never silently truncate digits off of a value. Values which are non-negative get sign extended as well, but that's just adding zero bits and the value fits in 2 hex digits, so printf %02 can do with a two digit output.
Note that there are 2 C dialects: one where plain char is signed, and one where it is unsigned. In yours it is signed. You may change it using an option, e.g. gcc and clang support -funsigned-char and -fsigned-char.
The printf() is a variadic function and its additional arguments (corresponding with ... part of its prototype) are subject to default argument promotions, thus char is promoted to int.
As your char has signed1, two's complement representation the most significant bit is set to one for 0xAF element. During promotion signed bit is propagated, resulting 0xFFFFFFAF of int type, as presumably sizeof(int) = 4 in your implementation.
By the way you are invoking undefined behaviour, since %X format specifier should be used for object of type unsigned int or at least for int with MSB that is unset (this is common, widely accepted practice).
As suggested you may consider use of unambiguous unsigned char type.
1) Implementation may choose between signed and unsigned represention of char. It's rather common that char is signed, but you cannot take it for granted for every other compiler on the planet. Some of them may allow to choose between these two modes, as mentioned in Jens's answer.

Resources