Sign extension with unsigned long long - c

We found some strange values being produced, a small test case is below.
This prints "FFFFFFFFF9A64C2A" . Meaning the unsigned long long seems to have been sign extended.
But why ?
All the types below are unsigned, so what's doing the sign extension ? The expected output
would be "F9A64C2A".
#include <stdio.h>
int main(int argc,char *argv[])
{
unsigned char a[] = {42,76,166,249};
unsigned long long ts;
ts = a[0] | a[1] << 8U | a[2] << 16U | a[3] << 24U;
printf("%llX\n",ts);
return 0;
}

In the expression a[3] << 24U, the a[1] has type unsigned char. Now, the "integer promotion" converts it to int because:
The following may be used in an expression wherever an int or unsigned int may
be used:
[...]
If an int can represent all values of the original type, the value is converted to
an int;
otherwise, it is converted to an unsigned int.
((draft) ISO/IEC 9899:1999, 6.3.1.1 2)
Please note also that the shift operators (other than most other operators) do not do the "usual arithmetic conversions" converting both operands to a common type. But
The type of the result is that of the promoted left operand.
(6.5.7 3)
On a 32 bit platform, 249 << 24 = 4177526784 interpreted as an int has its sign bit set.
Just changing to
ts = a[0] | a[1] << 8 | a[2] << 16 | (unsigned)a[3] << 24;
fixes the issue (The suffix Ufor the constants has no impact).

ts = ((unsigned long long)a[0]) |
((unsigned long long)a[1] << 8U) |
((unsigned long long)a[2] << 16U) |
((unsigned long long)a[3] << 24U);
Casting prevents converting intermediate results to default int type.

Some of the shifted a[i], when automatically converted from unsigned char to int, produce sign-extended values.
This is in accord with section 6.3.1 Arithmetic operands, subsection 6.3.1.1 Boolean, characters, and integers, of C draft standard N1570, which reads, in part, "2. The following may be used in an expression wherever an int or unsigned int may be used: ... — An object or expression with an integer type (other than int or unsigned int)
whose integer conversion rank is less than or equal to the rank of int and unsigned int. ... If an int can represent all values of the original type ..., the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. ... 3. The integer promotions preserve value including sign."
See eg www.open-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf
You could use code like the following, which works ok:
int i;
for (i=3, ts=0; i>=0; --i) ts = (ts<<8) | a[i];

Related

C uses different data type for arithmetic in the middle of an expression?

In Go (the language I'm most familiar with), the result of a mathematical operation is always the same data type as the operands, meaning if the operation overflows, the result will be incorrect. For example:
func main() {
var a byte = 100
var b byte = 9
var r byte = (a << b) >> b
fmt.Println(r)
}
This prints 0, as all the bits are shifted out of the bounds of a byte during the initial << 9 operation, then zeroes are shifted back in during the >> 9 operation.
However, this isn't the case in C:
int main() {
unsigned char a = 100;
unsigned char b = 9;
unsigned char r = (a << b) >> b;
printf("%d\n", r);
return 0;
}
This code prints 100. Although this yields the "correct" result, this is unexpected to me, as I'd only expect promotion if one of the operands were larger than a byte, but in this case all operands are bytes. It's as though the temporary variable holding the result of the << 9 operation is larger than the resulting variable, and is only downcast back to a byte after the full RHS is evaluated, and thus after the >> 9 operation restores the bits.
Obviously, if explicitly storing the result of the >> 9 into a byte before continuing, you get the same result as in Go:
int main() {
unsigned char a = 100;
unsigned char b = 9;
unsigned char c = a << b;
unsigned char r = c >> b;
printf("%d\n", r);
return 0;
}
This isn't merely the case with bitwise operators. I've tested with multiplication/division too, and it demonstrates the same behaviour.
My question is: is this behaviour of C defined? If so, where? Does it actually use a specific data type for the interim values of a complex expression? Or is this actually undefined behaviour, like an incidental result of the operations being performed in a 32/64 bit CPU register before being saved back to memory?
C 2018 6.5.7 discusses the shift operators. Paragraph 3 says:
The integer promotions are performed on each of the operands…
6.3.1.1 2 specifies the integer promotions:
… If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.
Thus in a << b where a and b are unsigned char, a is promoted to int, which is at least 16 bits. (A C implementation may define unsigned char to be more than eight bits. It could be the same width as int. In this case, the integer promotions would not convert a or b.)
Note that if the integer promotions were not applied, the behavior of evaluating a << b with b equal to 9 would not be defined by the C standard, as the behavior of the shift operators is not defined for shift amounts greater than or equal to the width of the left operator.
6.5.5 specifies the multiplicative operators. Paragraph 3 says:
The usual arithmetic conversions are performed on the operands.
6.3.1.8 specifies the usual arithmetic conversions:
… First, if the corresponding real type of either operand is long double, the other operand is converted, without change of type domain [complex or real], to a type whose corresponding real type is long double.
Otherwise, if the corresponding real type of either operand is double, the other operand is converted, without change of type domain, to a type whose corresponding real type is double.
Otherwise, if the corresponding real type of either operand is float, the other operand is converted, without change of type domain, to a type whose corresponding real type is float.
Otherwise, the integer promotions are performed on both operands. Then the following rules are applied to the promoted operands:
If both operands have the same type, then no further conversion is needed.
Otherwise, if both operands have signed integer types or both have unsigned integer types, the operand with the type of lesser integer conversion rank is converted to the type of the operand with greater rank.
Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.
Otherwise, if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, then the operand with unsigned integer type is converted to the type of the operand with signed integer type.
Otherwise, both operands are converted to the unsigned integer type corresponding to the type of the operand with signed integer type.
Rank has a technical definition that largely corresponds to width (number of bits in an integer type).
Thus, in a * b where a and b are unsigned char, they are both promoted to int (with the caveat above about wide unsigned char) and no further conversions are necessary. If one operand were wider than int, say long long int, while the other is unsigned char then both operands would be converted to that wider type.
Welcome to integer promotions! One behavior of the C language (an often criticized one, I'd add) is that types like char and short are promoted to int before doing any arithmetic operation with them, and the result is also int. What does this mean?
unsigned char foo(unsigned char x) {
return (x << 4) >> 4;
}
int main(void) {
if (foo(0xFF) == 0x0F) {
printf("Yay!\n");
}
else {
printf("... hey, wait a minute!\n");
}
return 0;
}
Needless to say, the above code prints ... hey, wait a minute!. Let's discover why:
// this line of code:
return (x << 4) >> 4;
// is converted to this (because of integer promotion):
return ((int) x << 4) >> 4;
Therefore, this is what happens:
x is unsigned char (8-bit) and its value is 0xFF,
x << 4 needs to be executed, but first x is converted to int (32-bit),
x << 4 becomes 0x000000FF << 4, and the result 0x00000FF0 is also int,
0x00000FF0 >> 4 is executed, yielding 0x000000FF,
finally, 0x000000FF is converted to unsigned char (because that's the return value of foo()), so it becomes 0xFF,
and that's why foo(0xFF) yields 0xFF instead of 0x0F.
How to prevent this? Simple: convert the result of x << 4 to unsigned char. In the previous example, 0x00000FF0 would have become 0xF0.
unsigned char foo(unsigned char x) {
return ((unsigned char) (x << 4)) >> 4;
}
foo(0xFF) == 0x0F
NOTE: in the previous examples, it is assumed that unsigned char is 8 bits and int is 32 bits, but the examples work for basically any situation in which CHAR_BIT == 8 (because C17 requires that sizeof(int) * CHAR_BIT >= 16).
P.S.: this answer is not as exhaustive as the C official standard document, of course. But you can find all the (valid and defined) behavior of C described in the latest draft of the ISO/IEC 9899:2018 standard (a.k.a. C17/C18).

Why does it make a difference if left and right shift are used together in one expression or not?

I have the following code:
unsigned char x = 255;
printf("%x\n", x); // ff
unsigned char tmp = x << 7;
unsigned char y = tmp >> 7;
printf("%x\n", y); // 1
unsigned char z = (x << 7) >> 7;
printf("%x\n", z); // ff
I would have expected y and z to be the same. But they differ depending on whether a intermediary variable is used. It would be interesting to know why this is the case.
This little test is actually more subtle than it looks as the behavior is implementation defined:
unsigned char x = 255; no ambiguity here, x is an unsigned char with value 255, type unsigned char is guaranteed to have enough range to store 255.
printf("%x\n", x); This produces ff on standard output but it would be cleaner to write printf("%hhx\n", x); as printf expects an unsigned int for conversion %x, which x is not. Passing x might actually pass an int or an unsigned int argument.
unsigned char tmp = x << 7; To evaluate the expression x << 7, x being an unsigned char first undergoes the integer promotions defined in the C Standard 6.3.3.1: If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions.
So if the number of value bits in unsigned char is smaller or equal to that of int (the most common case currently being 8 vs 31), x is first promoted to an int with the same value, which is then shifted left by 7 positions. The result, 0x7f80, is guaranteed to fit in the int type, so the behavior is well defined and converting this value to type unsigned char will effectively truncate the high order bits of the value. If type unsigned char has 8 bits, the value will be 128 (0x80), but if type unsigned char has more bits, the value in tmp can be 0x180, 0x380, 0x780, 0xf80, 0x1f80, 0x3f80 or even 0x7f80.
If type unsigned char is larger than int, which can occur on rare systems where sizeof(int) == 1, x is promoted to unsigned int and the left shift is performed on this type. The value is 0x7f80U, which is guaranteed to fit in type unsigned int and storing that to tmp does not actually lose any information since type unsigned char has the same size as unsigned int. So tmp would have the value 0x7f80 in this case.
unsigned char y = tmp >> 7; The evaluation proceeds the same as above, tmp is promoted to int or unsigned int depending on the system, which preserves its value, and this value is shifted right by 7 positions, which is fully defined because 7 is less than the width of the type (int or unsigned int) and the value is positive. Depending on the number of bits of type unsigned char, the value stored in y can be 1, 3, 7, 15, 31, 63, 127 or 255, the most common architecture will have y == 1.
printf("%x\n", y); again, it would be better t write printf("%hhx\n", y); and the output may be 1 (most common case) or 3, 7, f, 1f, 3f, 7f or ff depending on the number of value bits in type unsigned char.
unsigned char z = (x << 7) >> 7; The integer promotion is performed on x as described above, the value (255) is then shifted left 7 bits as an int or an unsigned int, always producing 0x7f80 and then right shifted by 7 positions, with a final value of 0xff. This behavior is fully defined.
printf("%x\n", z); Once more, the format string should be printf("%hhx\n", z); and the output would always be ff.
Systems where bytes have more than 8 bits are becoming rare these days, but some embedded processors, such as specialized DSPs still do that. It would take a perverse system to fail when passed an unsigned char for a %x conversion specifier, but it is cleaner to either use %hhx or more portably write printf("%x\n", (unsigned)z);
Shifting by 8 instead of 7 in this example would be even more contrived. It would have undefined behavior on systems with 16-bit int and 8-bit char.
The 'intermediate' values in your last case are (full) integers, so the bits that are shifted 'out of range' of the original unsigned char type are retained, and thus they are still set when the result is converted back to a single byte.
From this C11 Draft Standard:
6.5.7 Bitwise shift operators ... 3 The integer promotions are performed on each of the operands. The type of the
result is that of the promoted left operand ...
However, in your first case, unsigned char tmp = x << 7;, the tmp loses the six 'high' bits when the resultant 'full' integer is converted (i.e. truncated) back to a single byte, giving a value of 0x80; when this is then right-shifted in unsigned char y = tmp >> 7;, the result is (as expected) 0x01.
The shift operator is not defined for the char types. The value of any char operand is converted to int and the result of the expression is converted the char type.
So, when you put the left and right shift operators in the same expression the calculation will be performed as type int (without loosing any bit), and the result will be converted to char.

Unsigned char overflow with subtraction

I am trying to work out how unsigned overflow works with subtraction, so I wrote the following test to try it out:
#include<stdio.h>
#include<stdlib.h>
unsigned char minWrap(unsigned char a, unsigned char b) {
return a > b ? a - b : a + (0xff - b) + 1;
}
int main(int argc, char *argv[]) {
unsigned char a = 0x01, b = 0xff;
unsigned char c = a - b;
printf("0x%02x 0x%02x 0x%02x\n", a-b, c, minWrap(a,b));
return EXIT_SUCCESS;
}
Which gave as output:
0xffffff02 0x02 0x02
I would have expected the output to be the same three times. My question is: is it always safe to add/subtract unsigned chars and expect them to wrap around at 0xff?
Or more general, is it safe to compute with uintN_t and expect the result to be modulo 2^N?
is it always safe to add/subtract unsigned chars and expect them to wrap around at 0xff?
No. In C, objects of type char go through the usual integer promotions. So if the range of char fits in int (usual), it is converts to int, else unsigned.
a - b --> 0x01 - 0xFF --> 1 - 255 --> -254.
The below is undefined behavior as %x does not match an int and the value of -254 is not in the unsigned range (See #EOF comment). A typical behavior is a conversion to unsigned
printf("0x%02x\n", a-b);
// 0xffffff02
it safe to compute with uintN_t and expect the result to be modulo 2^N?
Yes. But be sure to make the result of type uintN_t and avoid unexpected usual integer promotions.
#include <inttypes.h>
uint8_t a = 0x01, b = 0xff;
uint8_t diff = a - b;
printf("0x%02x\n", (unsigned) diff);
printf("0x%02" PRTx8 "\n", diff);
a-b in the printf line is evaluated after a and b are promoted to int. Also, the value is being treated as an unsigned int due to the use of %x in the format specifier by your run time environment.
It's equivalent to:
int a1 = a;
int b1 = b;
int x = a1 - b1;
printf("0x%02x 0x%02x 0x%02x\n", x, c, minWrap(a,b));
Section 6.3.1.8 Usual arithmetic conversions of the C99 standard has more details.
In theory use of an int when an unsigned int is expected in printf is cause for undefined behavior. A lenient run time environment, like you have, treats the int as an unsigned int and proceeds to print the value.
From 6.3.1.8/1 of the standard concerning integer conversions:
The integer promotions are performed on both operands. Then the following rules are applied to the promoted operands
If both operands have the same type, then no further conversion is
needed.
Otherwise, if both operands have signed integer types or both have
unsigned integer types, the operand with the type of lesser integer
conversion rank is converted to the type of the operand with greater
rank.
Otherwise, if the operand that has unsigned integer type has rank
greater or equal to the rank of the type of the other operand, then
the operand with signed integer type is converted to the type of the
operand with unsigned integer type.
Otherwise, if the type of the operand with signed integer type can
represent all of the values of the type of the operand with unsigned
integer type, then the operand with unsigned integer type is converted
to the type of the operand with signed integer type.
Otherwise, both operands are converted to the unsigned integer type
corresponding to the type of the operand with signed integer type.
In this case, the wrap-around is well defined. In the expression a-b, because both operands are of type unsigned char, they are first promoted to int and the operation is performed. If this value was assigned to an unsigned char, it would be properly truncated. However, you're passing this value to printf with a %x format specifier which expects an unsigned int. To display it correctly, use %hhx which expects an unsigned char.

Bit shifting a byte by more than 8 bit

In here
When converting from bytes buffer back to unsigned long int:
unsigned long int anotherLongInt;
anotherLongInt = ( (byteArray[0] << 24)
+ (byteArray[1] << 16)
+ (byteArray[2] << 8)
+ (byteArray[3] ) );
where byteArray is declared as unsigned char byteArray[4];
Question:
I thought byteArray[1] would be just one unsigned char (8 bit). When left-shifting by 16, wouldn't that shift all the meaningful bits out and fill the entire byte with 0? Apparently it is not 8 bit. Perhaps it's shifting the entire byteArray which is a consecutive 4 byte? But I don't see how that works.
In that arithmetic context byteArray[0] is promoted to either int or unsigned int, so the shift is legal and maybe even sensible (I like to deal only with unsigned types when doing bitwise stuff).
6.5.7 Bitwise shift operators
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand.
And integer promotions:
6.3.1.1
If an int can represent all values of the original type the value is converted to an int;
otherwise, it is converted to an unsigned int. These are called the integer promotions.
The unsigned char's are implicitly cast to int's when shifting. Not sure to what type exactly it is cast, I thing that depends on the platform and the compiler. To get what you intend, it is safer to explicitly cast the bytes, that also makes it more portable and the reader immediately sees what you intend to do:
unsigned long int anotherLongInt;
anotherLongInt = ( ((unsigned long)byteArray[0] << 24)
+ ((unsigned long)byteArray[1] << 16)
+ ((unsigned long)byteArray[2] << 8)
+ ((unsigned long)byteArray[3] ) );

c standard and bitshifts

This question was first inspired by the (unexpected) results of this code:
uint16_t t16 = 0;
uint8_t t8 = 0x80;
uint8_t t8_res;
t16 = (t8 << 1);
t8_res = (t8 << 1);
printf("t16: %x\n", t16); // Expect 0, get 0x100
printf(" t8: %x\n", t8_res); // Expect 0, get 0
But it turns out this makes sense:
6.5.7 Bitwise shift operators
Constraints
2 Each of the operands shall have integer type
Thus the originally confused line is equivalent to:
t16 = (uint16_t) (((int) t8) << 1);
A little non-intuitive IMHO, but at least well-defined.
Ok, great, but then we do:
{
uint64_t t64 = 1;
t64 <<= 31;
printf("t64: %lx\n", t64); // Expect 0x80000000, get 0x80000000
t64 <<= 31;
printf("t64: %lx\n", t64); // Expect 0x0, get 0x4000000000000000
}
// edit: following the same literal argument as above, the following should be equivalent:
t64 = (uint64_t) (((int) t64) << 31);
// hence my confusion / expectation [end_edit]
Now, we get the intuitive result, but not what would be derived from my (literal) reading of the standard. When / how does this "further automatic type promotion" take place? Or is there a limitation elsewhere that a type can never be demoted (that would make sense?), in that case, how do the promotion rules apply for:
uint32_t << uint64_t
Since the standard does say both arguments are promoted to int; should both arguments be promoted to the same type here?
// edit:
More specifically, what should the result of:
uint32_t t32 = 1;
uint64_t t64_one = 1;
uint64_t t64_res;
t64_res = t32 << t64_one;
// end edit
The answer to the above question is resolved when we recognize that the spec does not demand a promotion to int specifically, rather to an integer type, which uint64_t qualifies as.
// CLARIFICATION EDIT:
Ok, but now I am confused again. Specifically, if uint8_t is an integer type, then why is it being promoted to int at all? It does not seem to be related to the constant int 1, as the following exercise demonstrates:
{
uint16_t t16 = 0;
uint8_t t8 = 0x80;
uint8_t t8_one = 1;
uint8_t t8_res;
t16 = (t8 << t8_one);
t8_res = (t8 << t8_one);
printf("t16: %x\n", t16);
printf(" t8: %x\n", t8_res);
}
t16: 100
t8: 0
Why is the (t8 << t8_one) expression being promoted if uint8_t is an integer type?
--
For reference, I'm working from ISO/IEC 9899:TC9, WG14/N1124 May 6, 2005. If that's out of date and someone could also provide a link to a more recent copy, that'd be appreciated as well.
I think the source of your confusion might be that the following two statements are not equivalent:
Each of the operands shall have integer type
Each of the operands shall have int type
uint64_t is an integer type.
The constraint in §6.5.7 that "Each of the operands shall have integer type." is a constraint that means you cannot use the bitwise shift operators on non-integer types like floating point values or pointers. It does not cause the effect you are noting.
The part that does cause the effect is in the next paragraph:
3. The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand.
The integer promotions are described in §6.3.1.1:
2. The following may be used in an expression wherever an int
or unsigned int may be used:
An object or expression with an integer type whose integer conversion rank is less than or equal to the rank of int and
unsigned int.
A bit-field of type _Bool, int, signed int, or unsigned int.
If an int can represent all values of the original type, the value
is converted to an int; otherwise, it is converted to an unsigned
int. These are called the integer promotions. All other types are
unchanged by the integer promotions.
uint8_t has a lesser rank than int, so the value is converted to an int (since we know that an int must be able to represent all the values of uint8_t, given the requirements on the ranges of those two types).
The ranking rules are complex, but they guarantee that a type with a higher rank cannot have a lesser precision. This means, in effect, that types cannot be "demoted" to a type with lesser precision by the integer promotions (it is possible for uint64_t to be promoted to int or unsigned int, but only if the range of the type is at least that of uint64_t).
In the case of uint32_t << uint64_t, the rule that kicks in is "The type of the result is that of the promoted left operand". So we have a few possibilities:
If int is at least 33 bits, then uint32_t will be promoted to int and the result will be int;
If int is less than 33 bits and unsigned int is at least 32 bits, then uint32_t will be promoted to unsigned int and the result will be unsigned int;
If unsigned int is less than 32 bits then uint32_t will be unchanged and the result will be uint32_t.
On today's common desktop and server implementations, int and unsigned int are usually 32 bits, and so the second possibility will occur (uint32_t is promoted to unsigned int). In the past it was common for int / unsigned int to be 16 bits, and the third possibility would occur (uint32_t left unpromoted).
The result of your example:
uint32_t t32 = 1;
uint64_t t64_one = 1;
uint64_t t64_res;
t64_res = t32 << t64_one;
Will be the value 2 stored into t64_res. Note though that this is not affected by the fact that the result of the expression is not uint64_t - and example of an expression that would be affected is:
uint32_t t32 = 0xFF000;
uint64_t t64_shift = 16;
uint64_t t64_res;
t64_res = t32 << t64_shift;
The result here is 0xf0000000.
Note that although the details are fairly intricate, you can boil it all down to a fairly simple rule that you should keep in mind:
In C, arithmetic is never done in types narrower than int /
unsigned int.
You found the wrong rule in the standard :( The relevant is something like "the usual integer type promotions apply". This is what hits you for the first example. If an integer type like uint8_t has a rank that is smaller than int it is promoted to int. uint64_t has not a rank that is smaller than int or unsigned so no promotion is performed and the << operator is applied to the uint64_t variable.
Edit: All integer types smaller than int are promoted for arithmetic. This is just a fact of life :) Whether or not uint32_t is promoted depends on the platform, because it might have the same rank or higher than int (not promoted) or a smaller rank (promoted).
Concerning the << operator the type of the right operand is not really important, what counts for the number of bits is the left one (with the above rules). More important for the right one is its value. It musn't be negative or exceed the width of the (promoted) left operand.

Resources