printf adds extra `FFFFFF` to hex print from a char array [duplicate]

printf adds extra `FFFFFF` to hex print from a char array [duplicate] - c

This question already has answers here:
Why does printf not print out just one byte when printing hex?
(5 answers)
Closed 6 years ago.
Consider the following simplified code bellow. I want to extract some binary data/stream from a file and print it to the standard output in Hexadecimal format.
I got extra 3 bytes 0xFFFFFF. What's wrong? From where did the extra bytes come?
output
in:
2000FFFFFFAF00690033005A00
out:
2000FFFFFFAF00690033005A00
program.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i;
char raw[10] = {0x20,0x00,0xAF,0x00,0x69,0x00,0x33,0x00,0x5A,0x00};
FILE *outfile;
char *buf;
printf("in:\n\t");
for( i=0; i<10; i++ )
printf("%02X", raw[i]);
outfile = fopen("raw_data.bin", "w+b");
fwrite(raw, 1, 10, outfile);
buf = (char *) malloc (32 * sizeof(char));
fseek(outfile, 0, SEEK_SET);
fread(buf, 1, 10, outfile);
printf("\nout:\n\t");
for( i=0; i<10; i++ )
printf("%02X", buf[i]);
printf("\n");
fclose(outfile);
return 0;
}

Sign extension. Your compiler is implementing char as a signed char. When you pass the chars to printf they are all being sign extended during their promotion to ints. When the first bit is a 0 this doesn't matter, because it gets extended with 0s.
0xAF in binary is 10101111 Since the first bit is a 1, when passing it to printf it is extended with all 1s in the conversion to int making it 11111111111111111111111110101111, which is 0xFFFFFFAF, the hex value you have.
Solution: Use unsigned char (instead of char) to prevent the sign extension from occurring in the call
const unsigned char raw[] = {0x20,0x00,0xAF,0x00,0x69,0x00,0x33,0x00,0x5A,0x00};
All of these values in your original example are being sign extended, it's just that 0xAF is the only one with a 1 in the first bit.
Another simpler example of the same behavior (live link):
signed char c = 0xAF; // probably gives an overflow warning
int i = c; // extra 24 bits are all 1
assert( i == 0xFFFFFFAF );

That's because 0xAF when converted from a signed character to a signed integer is negative (it is sign extended), and the %02X format is for unsigned arguments and prints the converted value as FFFFFFAF.
The extra characters appear because printf %x will never silently truncate digits off of a value. Values which are non-negative get sign extended as well, but that's just adding zero bits and the value fits in 2 hex digits, so printf %02 can do with a two digit output.
Note that there are 2 C dialects: one where plain char is signed, and one where it is unsigned. In yours it is signed. You may change it using an option, e.g. gcc and clang support -funsigned-char and -fsigned-char.

The printf() is a variadic function and its additional arguments (corresponding with ... part of its prototype) are subject to default argument promotions, thus char is promoted to int.
As your char has signed1, two's complement representation the most significant bit is set to one for 0xAF element. During promotion signed bit is propagated, resulting 0xFFFFFFAF of int type, as presumably sizeof(int) = 4 in your implementation.
By the way you are invoking undefined behaviour, since %X format specifier should be used for object of type unsigned int or at least for int with MSB that is unset (this is common, widely accepted practice).
As suggested you may consider use of unambiguous unsigned char type.
1) Implementation may choose between signed and unsigned represention of char. It's rather common that char is signed, but you cannot take it for granted for every other compiler on the planet. Some of them may allow to choose between these two modes, as mentioned in Jens's answer.

Related

Convert raw ASCII data to Hex string

I have the following code to convert raw ASCII data to Hex string. The full c code can be found here
void str2hex(char* inputStr, char* outputStr)
{
int i;
int counter;
i=0;
counter=0;
while(inputStr[counter] != '\0')
{
sprintf((char*)(outputStr+i),"%02X", inputStr[counter]);
i+=2; counter+=1;
}
outputStr[i++] = '\0';
}
It works fine for most of the values. But when I am trying the following input from terminal using echo as stdin echo 11223344556677881122334455667788|xxd -r -p| ./CProgram --stdin
11223344556677881122334455667788
It returns the following output
11223344556677FF11223344556677FF
As it can be seen instead of 88 it returns FF.
How can I adjust this code to get 88 instead of FF.

There are multiple issues all coalescing into your problem.
The first issue is that it's compiler-defined if char is a signed or unsigned integer type. Your compiler seem to have signed char types.
The second issue is that on most systems today, signed integers are represented using two's complement, where the most significant bit indicates the sign.
The third issue is that vararg functions like printf will do default argument promotion of its arguments. That means types smaller than int will be promoted to int. And that promotion will keep the value of the converted integer, which means negative values will be sign-extended. Sign-extension means that the most significant bit will be copied all the way to the "top" when extending the value. That means the signed byte 0xff will be extended to 0xffffffff when promoted to an int.
Now when your code tries to convert the byte 0x88 it will be treated as the negative number -120, not 136 as you might expect.
There are two possible solutions to this:
Explicitly use unsigned char for the input string:
void str2hex(const unsigned char* inputStr, char* outputStr);
Use the hh prefix in the printf format:
sprintf((char*)(outputStr+i),"%02hhX", inputStr[counter]);
This tells sprintf that the argument is a single byte, and will mask out the upper bits of the (promoted) integer.

Size of characters in *argv[] when passed argument from command line [duplicate]

This question already has answers here:
Printing hexadecimal characters in C
(7 answers)
Is char signed or unsigned by default?
(6 answers)
Closed 3 years ago.
I have simple program.
#include <stdio.h>
#include <string.h
int main(int argc, char *argv[])
{
for (int i = 0; i < strlen(argv[1]); ++i)
printf("%x ", argv[1][i]);
printf("\n");
}
I run it like
$ ./program 111
31 31 31
But when I run it like
$ ./program ●●●
ffffffe2 ffffff97 ffffff8f ffffffe2 ffffff97 ffffff8f ffffffe2 ffffff97 ffffff8f
Here each ● is should be encoded by 3 bytes (UTF-8): e2 97 8f, but looks like it is encoded by 3 unsigned.
I don't understand where the ffffff comes from if sizeof(char) is always 1 byte.

printf() is a function accepting a variable number of arguments.
Any integer argument of a type shorter than int is automatically converted to type int.
Apparently, in your implementation, the "character" little-round-thing is composed of 3 chars, all with a negative value.
Try these
printf("%x ", (unsigned char)argv[1][i]);
printf("%hhx ", argv[1][i]); // thanks to Jonathan Leffler

UTF-8 codeunits for multi-codeunit codepoints (everything but ASCII) are all from 128 to 255, meaning outside the ASCII range.
printf() is a vararg function, and all the arguments passed to the vararg part (all but the format-string) are subject to the standard promotions.
As your implementation's bare char is 8bit signed 2s-complement, meaning the UTF-8 codeunit-value is negative, and between -1 and -128, after promotion you have an int with that value.
Then you lie to printf() by asserting it's an unsigned (%x is for unsigned int), and 2s-complement results in your Undefined Behavior printing a very big unsigned int.
You could get the right result by using %hhx, though strictly speaking you should cast the argument to unsigned char.

I don't understand where the ffffff comes from if sizeof(char) is always 1 byte.
by definition sizeof(char) is 1, but '●' is not a char in the C sense and produces 3 char
your char are visibly signed (a char is a signed char by default in your case), each the input ● produce each 3 negative codes, because your char is converted to an int (32b in your case) and the format %x consider the argument unsigned you have these output
you will have the same output doing printf("%x", -30); -> ffffffe2
note to do for (int i = 0; i < strlen(argv[1]); ++i) is expensive for nothing, the length doesn't change, better to save it or to just do for (int i = 0; argv[1][i] != 0; ++i)
it was also better to check argc is at least 1 before to look into argv[1]

Why printf show false value of an hex number [duplicate]

This question already has answers here:
Why does printf not print out just one byte when printing hex?
(5 answers)
Printing hexadecimal characters in C
(7 answers)
How to print 1 byte with printf?
(4 answers)
Closed 5 years ago.
Code
char a;
a = 0xf1;
printf("%x\n", a);
Output
fffffff1
printf() show 4 bytes, that exactly we have one byte in a.
What is the reason of this misbehavior?
How can i correct it?

What is the reason of this misbehavior?
This question looks strangely similar to another I have answered; it even contains a similar value (0xfffffff1). In that answer, I provide some information required to understand what conversion happens when you pass a small value (such as a char) to a variadic function such as printf. There's no point repeating that information here.
If you inspect CHAR_MIN and CHAR_MAX from <limits.h>, you're likely to find that your char type is signed, and so 0xf1 does not fit as an integer value inside of a char.
Instead, it ends up being converted in an implementation-defined manner, which for the majority of us means it's likely to end up with one of the high-order bits becoming the sign bit. When these values are promoted to int (in order to pass to printf), sign extension occurs to preserve the value (that is, a char that has a value of -1 should be converted to an int that has a value of -1 as an int, so too is the underlying representation for your example likely to be transformed from 0xf1 to 0xfffffff1).
printf("CHAR_MIN .. CHAR_MAX: %d .. %d\n", CHAR_MIN, CHAR_MAX);
printf("Does %d fit? %s\n", '\xFF', '\xFF' >= CHAR_MIN && '\xFF' <= CHAR_MAX ? "Yes!"
: "No!");
printf("%d %X\n", (char) -1, (char) -1); // Both of these get converted to int
printf("%d %X\n", -1, -1); // ... and so are equivalent to these
How can i correct it?
Declare a with a type that can fit the value 0xf1, for example int or unsigned char.

printf is a variable argument function, so the compiler does its best but cannot check strict compliance between format specifier and argument type.
Here you're passing a char with a %x (integer, hex) format specifier.
So the value is promoted to a signed integer (because > 127: negative char and char is signed on most systems, on yours that's for sure)
Either:
change a to int (simplest)
change a to unsigned char (as suggested by BLUEPIXY) that takes care of the sign in the promotion
change format to %hhx as stated in the various docs (note that on my gcc 6.2.1 compiler hhx is not recognized, even if hx is)
note that the compiler warns you before reaching printf that you have a problem:
gcc -Wall -Wpedantic test.c
test.c: In function 'main':
test.c:6:5: warning: overflow in implicit constant conversion [-Woverflow]
a = 0xf1;

You should use int a instead of char a because char is unsigned and can store only 1 byte from 0 to 255.
And hex number need many storage to store it, and also int storage size is 2 or 4 bytes. So it's good to use int here to store hex number.

Since characters from -128 to -1 are same as from +128 to +255, then what is the point of using unsigned char?

#include <stdio.h>
#include <conio.h>
int main()
{
char a=-128;
while(a<=-1)
{
printf("%c\n",a);
a++;
}
getch();
return 0;
}
The output of the above code is same as the output of the code below
#include <stdio.h>
#include <conio.h>
int main()
{
unsigned char a=+128;
while(a<=+254)
{
printf("%c\n",a);
a++;
}
getch();
return 0;
}
Then why we use unsigned char and signed char?

K & R, chapter and verse, p. 43 and 44:
There is one subtle point about the conversion of characters to
integers. The language does not specify whether variables of type char
are signed or unsigned quantities. When a char is converted to an int,
can it ever produce a negative integer? The answer varies from machine
to machine, reflecting differences in architecture. On some machines,
a char whose leftmost bit is 1 will be converted to a negative integer
("sign extension"). On others, a char is promoted to an int by adding
zeros at the left end, and thus is always positive. [...] Arbitrary
bit patterns stored in character variables may appear to be negative
on some machines, yet positive on others. For portability, specify
signed or unsigned if non-character data is to be stored in char
variables.

With printing characters - no difference:
The function printf() uses "%c" and takes the int argument and converts it to unsigned char and then prints it.
char a;
printf("%c\n",a); // a is converted to int, then passed to printf()
unsigned char ua;
printf("%c\n",ua); // ua is converted to int, then passed to printf()
With printing values (numbers) - difference when system uses a char that is signed:
char a = -1;
printf("%d\n",a); // --> -1
unsigned char ua = -1;
printf("%d\n",ua); // --> 255 (Assume 8-bit unsigned char)
Note: Rare machines will have int the same size as char and other concerns apply.
So if code uses a as a number rather than a character, the printing differences are significant.

The bit representation of a number is what the computer stores, but it doesn't mean anything without someone (or something) imposing a pattern onto it.
The difference between the unsigned char and signed char patterns is how we interpret the set bits. In one case we decide that zero is the smallest number and we can add bits until we get to 0xFF or binary 11111111. In the other case we decide that 0x80 is the smallest number and we can add bits until we get to 0x7F.
The reason we have the funny way of representing signed numbers (the latter pattern) is because it places zero 0x00 roughly in the middle of the sequence, and because 0xFF (which is -1, right before zero) plus 0x01 (which is 1, right after zero) add together to carry until all the bits carry off the high end leaving 0x00 (-1 + 1 = 0). Likewise -5 + 5 = 0 by the same mechanisim.
For fun, there are a lot of bit patterns that mean different things. For example 0x2a might be what we call a "number" or it might be a * character. It depends on the context we choose to impose on the bit patterns.

Because unsigned char is used for one byte integer in C89.
Note there are three distinct char related types in C89: char, signed char, unsigned char.
For character type, char is used.
unsigned char and signed char are used for one byte integers like short is used for two byte integers. You should not really use signed char or unsigned char for characters. Neither should you rely on the order of those values.

Different types are created to tell the compiler how to "understand" the bit representation of one or more bytes. For example, say I have a byte which contains 0xFF. If it's interpreted as a signed char, it's -1; if it's interpreted as a unsigned char, it's 255.
In your case, a, no matter whether signed or unsigned, is integral promoted to int, and passed to printf(), which later implicitly convert it to unsigned char before printing it out as a character.
But let's consider another case:
#include <stdio.h>
#include <string.h>
int main(void)
{
char a = -1;
unsigned char b;
memmove(&b, &a, 1);
printf("%d %u", a, b);
}
It's practically acceptable to simply write printf("%d %u", a, a);. memmove() is used just to avoid undefined behaviour.
It's output on my machine is:
-1 4294967295
Also, think about this ridiculous question:
Suppose sizeof (int) == 4, since arrays of characters (unsigned
char[]){UCHAR_MIN, UCHAR_MIN, UCHAR_MIN, UCHAR_MIN} to (unsigned
char[]){UCHAR_MAX, UCHAR_MAX, UCHAR_MAX, UCHAR_MAX} are same as
unsigned ints from UINT_MIN to UINT_MAX, then what is the point
of using unsigned int?

are int and char represented using the same bits internally by gcc?

I was playing around with unicode characters (without using wchar_t support) just for fun. I'm only using the regular char data type. I noticed that while printing them in hex they were showing up full 4 bytes instead of just one byte.
For ex. consider this c file:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char *s = (char *) malloc(100);
fgets(s, 100, stdin);
while (s && *s != '\0') {
printf("%x\n", *s);
s++;
}
return 0;
}
After compiling with gcc and giving input as 'cent' symbol (hex: c2 a2) I get the following output
$ ./a.out
¢
ffffffc2: ?
ffffffa2: ?
a:
So instead of just printing c2 and a2 I got the whole 4 bytes as if it's an int type.
Does this mean char is not really 1-byte in length, ascii made it look like 1-byte?

Maybe the reason why the upper three bytes become 0xFFFFFF needs a bit more explanation?
The upper three bytes of the value printed for *s have a value of 0xFF due to sign extension.
The char value passed to printf is extended to an int before the call to printf.
This is due to C's default behaviour.
In the absence of signed or unsigned, the compiler can default to interpret char as signed char or unsigned char. It is consistently one or the other unless explicitly changed with a command line option or pragma's. In this case we can see that it is signed char.
In the absence of more information (prototypes or casts), C passes:
int, so char, short, unsigned char unsigned short are converted to int. It never passes a char, unsigned char, signed char, as a single byte, it always passes an int.
unsigned int is the same size as int so the value is passed without change
The compiler needs to decide how to convert the smaller value to an int.
signed values: the upper bytes of the int are sign extended from the smaller value, which effectively copies the top, sign bit, upwards to fill the int. If the top bit of the smaller signed value is 0, the upper bytes are filled with 0. If the top bit of the smaller signed value is 1, the upper bytes are filled with 1. Hence printf("%x ",*s) prints ffffffc2
unsigned values are not sign extended, the upper bytes of the int are 'zero padded'
Hence the reason C can call a function without a prototype (though the compiler will usually warn about that)
So you can write, and expect this to run (though I would hope your compiler issues warnings):
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
signed char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%x schar[1]=%x uchar[0]=%x uchar[1]=%x\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
That prints:
schar[0]=70 schar[1]=ffffff80 uchar[0]=70 uchar[1]=80
The char value is interpreted by my (Mac's gcc) compiler as signed char, so the compiler generates code to sign extended the char to the int before the printf call.
Where the signed char value has its top (sign) bit set (\x80), the conversion to int sign extends the char value. The sign extension fills in the upper bytes (in this case 3 more bytes to make a 4 byte int) with 1's, which get printed by printf as ffffff80
Where the signed char value has its top (sign) bit clear (\x70), the conversion to int still sign extends the char value. In this case the sign is 0, so the sign extension fills in the upper bytes with 0's, which get printed by printf as 70
My example shows the case where the value is unsigned char. In these two cases the value is not sign extended because the value is unsigned. Instead they are extended to int with 0 padding. It might look like printf is only printing one byte because the adjacent three bytes of the value would be 0. But it is printing the entire int, it happens that the value is 0x00000070 and 0x00000080 because the unsigned char values were converted to
int without sign extension.
You can force printf to only print the low byte of the int, by using suitable formatting (%hhx), so this correctly prints only the value in the original char:
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%hhx schar[1]=%hhx uchar[0]=%hhx uchar[1]=%hhx\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
This prints:
schar[0]=70 schar[1]=80 uchar[0]=70 uchar[1]=80
because printf interprets the %hhx to treat the int as an unsigned char. This does not change the fact that the char was sign extended to an int before printf was called. It is only a way to tell printf how to interpret the contents of the int.
In a way, for signed char *schar, the meaning of %hhx looks slightly misleading, but the '%x' format interprets int as unsigned anyway, and (with my printf) there is no format to print hex for signed values (IMHO it would be a confusing).
Sadly, ISO/ANSI/... don't freely publish our programming language standards, so I can't point to the specification, but searching the web might turn up working drafts. I haven't tried to find them. I would recommend "C: A Reference Manual" by Samuel P. Harbison and Guy L. Steele as a cheaper alternative to the ISO document.
HTH

No. printf is a variable argument function, arguments to a variable argument function will be promoted to an int. And in this case the char was negative, so it gets sign extended.

%x tells printf that the value to print is an unsigned int. So, it promotes the char to an unsigned int, sign extending as necessary and then prints out the resulting value.