Bit shifting a byte by more than 8 bit

Bit shifting a byte by more than 8 bit - c

In here
When converting from bytes buffer back to unsigned long int:
unsigned long int anotherLongInt;
anotherLongInt = ( (byteArray[0] << 24)
+ (byteArray[1] << 16)
+ (byteArray[2] << 8)
+ (byteArray[3] ) );
where byteArray is declared as unsigned char byteArray[4];
Question:
I thought byteArray[1] would be just one unsigned char (8 bit). When left-shifting by 16, wouldn't that shift all the meaningful bits out and fill the entire byte with 0? Apparently it is not 8 bit. Perhaps it's shifting the entire byteArray which is a consecutive 4 byte? But I don't see how that works.

In that arithmetic context byteArray[0] is promoted to either int or unsigned int, so the shift is legal and maybe even sensible (I like to deal only with unsigned types when doing bitwise stuff).
6.5.7 Bitwise shift operators
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand.
And integer promotions:
6.3.1.1
If an int can represent all values of the original type the value is converted to an int;
otherwise, it is converted to an unsigned int. These are called the integer promotions.

The unsigned char's are implicitly cast to int's when shifting. Not sure to what type exactly it is cast, I thing that depends on the platform and the compiler. To get what you intend, it is safer to explicitly cast the bytes, that also makes it more portable and the reader immediately sees what you intend to do:
unsigned long int anotherLongInt;
anotherLongInt = ( ((unsigned long)byteArray[0] << 24)
+ ((unsigned long)byteArray[1] << 16)
+ ((unsigned long)byteArray[2] << 8)
+ ((unsigned long)byteArray[3] ) );

Related

Is there a generic "isolate a single byte" bit mask for all systems, irrespective of CHAR_BIT?

If CHAR_BIT == 8 on your target system (most cases), it's very easy to mask out a single byte:
unsigned char lsb = foo & 0xFF;
However, there are a few systems and C implementations out there where CHAR_BIT is neither 8 nor a multiple thereof. Since the C standard only mandates a minimum range for char values, there is no guarantee that masking with 0xFF will isolate an entire byte for you.
I've searched around trying to find information about a generic "byte mask", but so far haven't found anything.
There is always the O(n) solution:
unsigned char mask = 1;
size_t i;
for (i = 0; i < CHAR_BIT; i++)
{
mask |= (mask << i);
}
However, I'm wondering if there is any O(1) macro or line of code somewhere that can accomplish this, given how important this task is in many system-level programming scenarios.

The easiest way to extract an unsigned char from an integer value is simply to cast it to unsigned char:
(unsigned char) SomeInteger
Per C 2018 6.3.1.3 2, the result is the remainder of SomeInteger modulo UCHAR_MAX+1. (This is a non-negative remainder; it is always adjusted to be greater than or equal to zero and less than UCHAR_MAX+1.)
Assigning to an unsigned char has the same effect, as assignment performs a conversion (and initializing works too):
unsigned char x;
…
x = SomeInteger;
If you want an explicit bit mask, UCHAR_MAX is such a mask. This is so because unsigned integers are pure binary in C, and the maximum value of an unsigned integer has all value bits set. (Unsigned integers in general may also have padding bit, but unsigned char may not.)
One difference can occur in very old or esoteric systems: If a signed integer is represented with sign-and-magnitude or one’s complement instead of today’s ubiquitous two’s complement, then the results of extracting an unsigned char from a negative value will differ depending on whether you use the conversion method or the bit-mask method.

On review (after accept) , #Eric Postpischil answer's part about UCHAR_MAX makes for a preferable mask.
#define BYTE_MASK UCHAR_MAX
The value UCHAR_MAX shall equal 2CHAR_BIT − 1. C11dr §5.2.4.2.1 2
As unsigned char cannot have padding. So UCHAR_MAX is always the all bits set pattern in a character type and hence in a C "byte".
some_signed & some_unsigned is a problem on non-2's complement as the some_signed is convert to unsigned before the & thus changing the bit pattern on negative vales. To avoid, the all ones mask needs to be signed when masking signed types. The is usually the case with foo & UINT_MAX
Conclusion
Assume: foo is of some integer type.
If only 2's complement is of concern, use a cast - it does not change the bit pattern.
unsigned char lsb = (unsigned char) foo;
Otherwise with any integer encoding and CHAR_MAX <= INT_MAX
unsigned char lsb = foo & UCHAR_MAX;
Otherwise TBD
Shifting an unsigned 1 by CHAR_BIT and then subtracting 1 will work even on esoteric non-2's complement systems. #Some programmer dude. Be sure to use unsigned math.
On such systems, this preserves the bit patten unlike (unsigned char) cast on negative integers.
unsigned char mask = (1u << CHAR_BIT) - 1u;
unsigned char lsb = foo & mask;
Or make a define
#define BYTE_MASK ((1u << CHAR_BIT) - 1u)
unsigned char lsb = foo & BYTE_MASK;
To also handle those pesky cases where UINT_MAX == UCHAR_MAX where 1u << CHAR_BIT would be UB, shift in 2 steps.
#define BYTE_MASK (((1u << (CHAR_BIT - 1)) << 1u) - 1u)

UCHAR_MAX does not have to be equal to (1U << CHAR_BIT) - 1U
you need actually to and with that calculated value not with the UCHAR_MAX
value & ((1U << CHAR_BIT) - 1U).
Many real implementations (for example TI) define UCHAR_MAX as 255 and emit the code which behaves like the one on the machines having 8 bits bytes. It is done to preserve compatibility with the code written for other targets.
For example
unsigned char x;
x++;
will generate the code which checks in the value of x is larger than UCHAR_MAX and if it the truth zeroing the 'x'

sign extension in C

I'm looking here to understand sign extension:
http://www.shrubbery.net/solaris9ab/SUNWdev/SOL64TRANS/p8.html
struct foo {
unsigned int base:19, rehash:13;
};
main(int argc, char *argv[])
{
struct foo a;
unsigned long addr;
a.base = 0x40000;
addr = a.base << 13; /* Sign extension here! */
printf("addr 0x%lx\n", addr);
addr = (unsigned int)(a.base << 13); /* No sign extension here! */
printf("addr 0x%lx\n", addr);
}
They claim this:
------------------ 64 bit:
% cc -o test64 -xarch=v9 test.c
% ./test64
addr 0xffffffff80000000
addr 0x80000000
%
------------------ 32 bit:
% cc -o test32 test.c
% ./test32
addr 0x80000000
addr 0x80000000
%
I have 3 questions:
What is sign extension ? Yes I read wiki, but didn't understand when type promotion occurs, what's going on with sign extension?
Why ffff.. in 64 bit(referring addr) ?
When I do type cast, why no sign extension?
EDIT:
4. Why not an issue in 32 bit system?

The left operand of the << operator undergoes standard promotions, so in your case it is promoted to int -- so far so good. Next, the int of value 0x4000 is multiplied by 213, which causes overflow and thus undefined behaviour. However, we can see what's happening: the value of the expression is now simply INT_MIN, the smallest representable int. Finally, when you convert that to an unsigned 64-bit integer, the usual modular arithmetic rules entail that the resulting value is 0xffffffff80000000. Similarly, converting to an unsigned 32-bit integer gives the value 0x80000000.
To perform the operation on unsigned values, you need to control the conversions with a cast:
(unsigned int)(a.base) << 13

a.base << 13
The bitwise operator performs integer promotions on both its operands.
So this is equivalent to:
(int) a.base << 13
which is a negative value of type int.
Then:
addr = (int) a.base << 13;
converts this signed negative value ((int) a.base << 13) to the type of addr which is unsigned long through integer conversions.
Integer conversions (C99, 6.3.1.3p2) rules that is the same as doing:
addr = (long) ((int) a.base << 13);
The conversion long performs the sign extension here because ((int) a.base << 13) is a negative signed number.
On the other case, with a cast you have something equivalent to:
addr = (unsigned long) (unsigned int) ((int) a.base << 13);
so no sign extension is performed in your second case because (unsigned int) ((int) a.base << 13) is an unsigned (and positive of course) value.
EDIT: as KerrekSB mentioned in his answer a.base << 13 is actually not representable in an int (I assume 32-bit int) so this expression invokes undefined behavior and the implementation has he right to behave in any other way, for example crashing.
For information, this is definitely not portable but if you are using gcc, gcc does not consider a.base << 13 here as undefined behavior. From gcc documentation:
"GCC does not use the latitude given in C99 only to treat certain aspects of signed '<<' as undefined, but this is subject to change."
in http://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html

This is more of a question about bit-fields. Note that if you change the struct to
struct foo {
unsigned int base, rehash;
};
you get very different results.
As #JensGustedt noted in Type of unsigned bit-fields: int or unsigned int the specification says:
If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int;
Even though you've specified that base is unsigned, the compiler converts it to a signed int when you read it. That's why you don't get sign extension when you cast it to unsigned int.
Sign extension has to do with how negative numbers are represented in binary. The most common scheme is 2s complement. In this scheme, -1 is represented in 32 bits as 0xFFFFFFFF, -2 is 0xFFFFFFFE, etc. So what should be done when we want to convert a 32-bit number to a 64-bit number, for example? If we convert 0xFFFFFFFF to 0x00000000FFFFFFFF, the numbers will have the same unsigned value (about 4 billion), but different signed values (-1 vs. 4 billion). On the other hand, if we convert 0xFFFFFFFF to 0xFFFFFFFFFFFFFFFF, the numbers will have the same signed value (-1) but different unsigned values. The former is called zero-extension (and is appropriate for unsigned numbers) and the latter is called sign-extension (and is appropriate for signed numbers). It's called "sign-extension" because the "sign bit" (the most significant, or left-most bit) is extended, or copied, to make the number wider.

It took me a while and a lot of reading/testing.
Maybe my, beginner way to understand what's going on will get to you (as I got it)
a.base=0x40000 (1(0)x18) -> 19-bit bitfield
addr=a.base<<13.
any value a.base can hold int can hold, too so conversion from 19-bit unsigned int bitfield to 32-bit signed integer. (a.base is now (0)x13,1,(0)x18).
now (converted to signed int a.base)<<13 which results in 1(0)x31). Remember it's signed int now.
addr=(1(0)x31). addr is of unsigned long type(64 bit) so to do the assignment righ value is converted to long int. Conversion from signed int to long int make addr (1)x33,(0)x31.
And that's what being printed after all of thos converstions you weren't even aware of:
0xffffffff80000000.
Why the second line prints 0x80000000 is because of that cast to (unsigned int) before conversion to long int. When converting unsigned int to long int there is no bit sign so value is just filled with trailing 0's to match the size and that's all.
What's different on with 32-bit, is during conversion from 32-bit signed int to 32-bit unsigned long their sizes match and do trailing bit signs are added,so:
1(0)x31 will stay 1(0)x31
even after conversion from int to long int(they have the same size, the value is interpreted different but bits are intact.)
Quotation from your link:
Any code that makes this assumption must be changed to work for both
ILP32 and LP64. While an int and a long are both 32-bits in the ILP32
data model, in the LP64 data model, a long is 64-bits.

c standard and bitshifts

This question was first inspired by the (unexpected) results of this code:
uint16_t t16 = 0;
uint8_t t8 = 0x80;
uint8_t t8_res;
t16 = (t8 << 1);
t8_res = (t8 << 1);
printf("t16: %x\n", t16); // Expect 0, get 0x100
printf(" t8: %x\n", t8_res); // Expect 0, get 0
But it turns out this makes sense:
6.5.7 Bitwise shift operators
Constraints
2 Each of the operands shall have integer type
Thus the originally confused line is equivalent to:
t16 = (uint16_t) (((int) t8) << 1);
A little non-intuitive IMHO, but at least well-defined.
Ok, great, but then we do:
{
uint64_t t64 = 1;
t64 <<= 31;
printf("t64: %lx\n", t64); // Expect 0x80000000, get 0x80000000
t64 <<= 31;
printf("t64: %lx\n", t64); // Expect 0x0, get 0x4000000000000000
}
// edit: following the same literal argument as above, the following should be equivalent:
t64 = (uint64_t) (((int) t64) << 31);
// hence my confusion / expectation [end_edit]
Now, we get the intuitive result, but not what would be derived from my (literal) reading of the standard. When / how does this "further automatic type promotion" take place? Or is there a limitation elsewhere that a type can never be demoted (that would make sense?), in that case, how do the promotion rules apply for:
uint32_t << uint64_t
Since the standard does say both arguments are promoted to int; should both arguments be promoted to the same type here?
// edit:
More specifically, what should the result of:
uint32_t t32 = 1;
uint64_t t64_one = 1;
uint64_t t64_res;
t64_res = t32 << t64_one;
// end edit
The answer to the above question is resolved when we recognize that the spec does not demand a promotion to int specifically, rather to an integer type, which uint64_t qualifies as.
// CLARIFICATION EDIT:
Ok, but now I am confused again. Specifically, if uint8_t is an integer type, then why is it being promoted to int at all? It does not seem to be related to the constant int 1, as the following exercise demonstrates:
{
uint16_t t16 = 0;
uint8_t t8 = 0x80;
uint8_t t8_one = 1;
uint8_t t8_res;
t16 = (t8 << t8_one);
t8_res = (t8 << t8_one);
printf("t16: %x\n", t16);
printf(" t8: %x\n", t8_res);
}
t16: 100
t8: 0
Why is the (t8 << t8_one) expression being promoted if uint8_t is an integer type?
--
For reference, I'm working from ISO/IEC 9899:TC9, WG14/N1124 May 6, 2005. If that's out of date and someone could also provide a link to a more recent copy, that'd be appreciated as well.

I think the source of your confusion might be that the following two statements are not equivalent:
Each of the operands shall have integer type
Each of the operands shall have int type
uint64_t is an integer type.

The constraint in §6.5.7 that "Each of the operands shall have integer type." is a constraint that means you cannot use the bitwise shift operators on non-integer types like floating point values or pointers. It does not cause the effect you are noting.
The part that does cause the effect is in the next paragraph:
3. The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand.
The integer promotions are described in §6.3.1.1:
2. The following may be used in an expression wherever an int
or unsigned int may be used:
An object or expression with an integer type whose integer conversion rank is less than or equal to the rank of int and
unsigned int.
A bit-field of type _Bool, int, signed int, or unsigned int.
If an int can represent all values of the original type, the value
is converted to an int; otherwise, it is converted to an unsigned
int. These are called the integer promotions. All other types are
unchanged by the integer promotions.
uint8_t has a lesser rank than int, so the value is converted to an int (since we know that an int must be able to represent all the values of uint8_t, given the requirements on the ranges of those two types).
The ranking rules are complex, but they guarantee that a type with a higher rank cannot have a lesser precision. This means, in effect, that types cannot be "demoted" to a type with lesser precision by the integer promotions (it is possible for uint64_t to be promoted to int or unsigned int, but only if the range of the type is at least that of uint64_t).
In the case of uint32_t << uint64_t, the rule that kicks in is "The type of the result is that of the promoted left operand". So we have a few possibilities:
If int is at least 33 bits, then uint32_t will be promoted to int and the result will be int;
If int is less than 33 bits and unsigned int is at least 32 bits, then uint32_t will be promoted to unsigned int and the result will be unsigned int;
If unsigned int is less than 32 bits then uint32_t will be unchanged and the result will be uint32_t.
On today's common desktop and server implementations, int and unsigned int are usually 32 bits, and so the second possibility will occur (uint32_t is promoted to unsigned int). In the past it was common for int / unsigned int to be 16 bits, and the third possibility would occur (uint32_t left unpromoted).
The result of your example:
uint32_t t32 = 1;
uint64_t t64_one = 1;
uint64_t t64_res;
t64_res = t32 << t64_one;
Will be the value 2 stored into t64_res. Note though that this is not affected by the fact that the result of the expression is not uint64_t - and example of an expression that would be affected is:
uint32_t t32 = 0xFF000;
uint64_t t64_shift = 16;
uint64_t t64_res;
t64_res = t32 << t64_shift;
The result here is 0xf0000000.
Note that although the details are fairly intricate, you can boil it all down to a fairly simple rule that you should keep in mind:
In C, arithmetic is never done in types narrower than int /
unsigned int.

You found the wrong rule in the standard :( The relevant is something like "the usual integer type promotions apply". This is what hits you for the first example. If an integer type like uint8_t has a rank that is smaller than int it is promoted to int. uint64_t has not a rank that is smaller than int or unsigned so no promotion is performed and the << operator is applied to the uint64_t variable.
Edit: All integer types smaller than int are promoted for arithmetic. This is just a fact of life :) Whether or not uint32_t is promoted depends on the platform, because it might have the same rank or higher than int (not promoted) or a smaller rank (promoted).
Concerning the << operator the type of the right operand is not really important, what counts for the number of bits is the left one (with the above rules). More important for the right one is its value. It musn't be negative or exceed the width of the (promoted) left operand.

Sign extension with unsigned long long

We found some strange values being produced, a small test case is below.
This prints "FFFFFFFFF9A64C2A" . Meaning the unsigned long long seems to have been sign extended.
But why ?
All the types below are unsigned, so what's doing the sign extension ? The expected output
would be "F9A64C2A".
#include <stdio.h>
int main(int argc,char *argv[])
{
unsigned char a[] = {42,76,166,249};
unsigned long long ts;
ts = a[0] | a[1] << 8U | a[2] << 16U | a[3] << 24U;
printf("%llX\n",ts);
return 0;
}

In the expression a[3] << 24U, the a[1] has type unsigned char. Now, the "integer promotion" converts it to int because:
The following may be used in an expression wherever an int or unsigned int may
be used:
[...]
If an int can represent all values of the original type, the value is converted to
an int;
otherwise, it is converted to an unsigned int.
((draft) ISO/IEC 9899:1999, 6.3.1.1 2)
Please note also that the shift operators (other than most other operators) do not do the "usual arithmetic conversions" converting both operands to a common type. But
The type of the result is that of the promoted left operand.
(6.5.7 3)
On a 32 bit platform, 249 << 24 = 4177526784 interpreted as an int has its sign bit set.
Just changing to
ts = a[0] | a[1] << 8 | a[2] << 16 | (unsigned)a[3] << 24;
fixes the issue (The suffix Ufor the constants has no impact).

ts = ((unsigned long long)a[0]) |
((unsigned long long)a[1] << 8U) |
((unsigned long long)a[2] << 16U) |
((unsigned long long)a[3] << 24U);
Casting prevents converting intermediate results to default int type.

Some of the shifted a[i], when automatically converted from unsigned char to int, produce sign-extended values.
This is in accord with section 6.3.1 Arithmetic operands, subsection 6.3.1.1 Boolean, characters, and integers, of C draft standard N1570, which reads, in part, "2. The following may be used in an expression wherever an int or unsigned int may be used: ... — An object or expression with an integer type (other than int or unsigned int)
whose integer conversion rank is less than or equal to the rank of int and unsigned int. ... If an int can represent all values of the original type ..., the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. ... 3. The integer promotions preserve value including sign."
See eg www.open-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf
You could use code like the following, which works ok:
int i;
for (i=3, ts=0; i>=0; --i) ts = (ts<<8) | a[i];

different result about bitwise operator in C

in order to realize logical right shift in c , i search the web and got the following C code
int a, b, c;
int x = -100;
a = (unsigned) x >> 2;
b = (0xffffffff & x) >> 2;
c = (0x0 | x ) >> 2;
now both a and b were logical right shift result(1006632960), but c was still arithmetic shift result(-25)， could somebody explain why ? thx

b = (0xffffffff & x) >> 2;
Assuming that your ints are 32 bits, the type of the literal constant 0xffffffff is unsigned int, because it is too large to fit in a plain int. The &, then, is between an unsigned int and an int, in which case the unsigned type wins by definition. The shift therefore happens on unsigned; thus it shifts in 0 bits from the left.
c = (0x0 | x ) >> 2;
The type of 0x0 defaults to int because it is small enough to fit, so the bitwise or happens on ints, and so does the following shift. It is implementation defined what happens when you shift a signed integer right, but most compilers will produce an arithmetic shift that sign-extends.

(unsigned) x is of type unsigned int so it get a logical shift.
0xffffffff (assuming 32 bit int) is of type unsigned int, so (0xffffffff & x) is also of type unsigned int so it get a logical shift.
0x0 is of type int, so (0x0|x) is of type int and get an arithmetic shift (well, it is implementation dependent).

It's all about the operand type of the operator >>. If it's signed - the right-shift sets the MSB to 1 if the operand was negative. If the operand is unsigned - MSB bits are always zero after right-shift.
In your first expression the operand is cast explicitly to unsigned.
In the second expression the (0xffffffff & x) us unsigned, because 0xffffffff definitely represents an unsigned integer (it's an overflow for signed).
OTOH in the third example 0x0 is signed (this is the default for integer constants). Hence the whole operand (0x0 | x ) is considered signed