today, I have been trying to write a function, which should rotate a given 64 bit integer n bits to the right, but also to the left, if the n is negative. Of course, bits out of the integer shall be rotated in on the other side.
I have kept the function quite simple.
void rotate(uint64_t *i, int n)
uint64_t one = 1;
if(n > 0) {
do {
int storeBit = *i & one;
*i = *i >> 1;
if(storeBit == 1)
*i |= 0x80000000000000;
n--;
}while(n>0);
}
}
possible inputs are:
uint64_t num = 0x2;
rotate(&num, 1); // num should be 0x1
rotate(&num, -1); // num should be 0x2, again
rotate(&num, 62); // num should 0x8
Unfortunately, I could not figure it out. I was hoping someone could help me.
EDIT: Now, the code is online. Sry, it took a while. I had some difficulties with the editor. But I just did it for the rotation to the right. The rotation to the left is missing, because I did not do it.
uint64_t rotate(uint64_t v, int n) {
n = n & 63U;
if (n)
v = (v >> n) | (v << (64-n));
return v; }
gcc -O3 produces:
.cfi_startproc
andl $63, %esi
movq %rdi, %rdx
movq %rdi, %rax
movl %esi, %ecx
rorq %cl, %rdx
testl %esi, %esi
cmovne %rdx, %rax
ret
.cfi_endproc
not perfect, but reasonable.
int storeBit = *i & one;
#This line you are assigning an 64 bit unsigned integer to probably 4 byte integer. I think your problem is related to this. In little endian machines things will be complicated if you do, non-defined operations.
if(n > 0)
doesnt takes negative n
Related
Here is a C function that adds an int to another, failing if overflow would happen:
int safe_add(int *value, int delta) {
if (*value >= 0) {
if (delta > INT_MAX - *value) {
return -1;
}
} else {
if (delta < INT_MIN - *value) {
return -1;
}
}
*value += delta;
return 0;
}
Unfortunately it is not optimized well by GCC or Clang:
safe_add(int*, int):
movl (%rdi), %eax
testl %eax, %eax
js .L2
movl $2147483647, %edx
subl %eax, %edx
cmpl %esi, %edx
jl .L6
.L4:
addl %esi, %eax
movl %eax, (%rdi)
xorl %eax, %eax
ret
.L2:
movl $-2147483648, %edx
subl %eax, %edx
cmpl %esi, %edx
jle .L4
.L6:
movl $-1, %eax
ret
This version with __builtin_add_overflow()
int safe_add(int *value, int delta) {
int result;
if (__builtin_add_overflow(*value, delta, &result)) {
return -1;
} else {
*value = result;
return 0;
}
}
is optimized better:
safe_add(int*, int):
xorl %eax, %eax
addl (%rdi), %esi
seto %al
jo .L5
movl %esi, (%rdi)
ret
.L5:
movl $-1, %eax
ret
but I'm curious if there's a way without using builtins that will get pattern-matched by GCC or Clang.
Tthe best one I came up with, if you don't have access to the overflow flag of the architecture, is to do things in unsigned. Just think of all bit arithmetic here in that we are only interested in the highest bit, which is the sign bit when interpreted as signed values.
(All that modulo sign errors, I didn't check this thouroughly, but I hope the idea is clear)
#include <stdbool.h>
bool overadd(int a[static 1], int b) {
unsigned A = a[0];
unsigned B = b;
// This computation will be done anyhow
unsigned AB = A + B;
// See if the sign bits are equal
unsigned AeB = ~(A^B);
unsigned AuAB = (A^AB);
// The function result according to these should be:
//
// AeB \ AuAB | false | true
//------------+-------+------
// false | false | false
// true | false | true
//
// So the expression to compute from the sign bits is (AeB & AuAB)
// This is INT_MAX
unsigned M = -1U/2;
bool ret = (AeB & AuAB) > M;
if (!ret) a[0] += b;
return ret;
}
If you find a version of the addition that is free of UB, such as an atomic one, the assembler is even without branch (but with a lock prefix)
#include <stdbool.h>
#include <stdatomic.h>
bool overadd(_Atomic(int) a[static 1], int b) {
unsigned A = a[0];
atomic_fetch_add_explicit(a, b, memory_order_relaxed);
unsigned B = b;
// This computation will be done anyhow
unsigned AB = A + B;
// See if the sign bits are equal
unsigned AeB = ~(A^B);
unsigned AuAB = (A^AB);
// The function result according to these should be:
//
// AeB \ AuAB | false | true
//------------+-------+------
// false | false | false
// true | false | true
//
// So the expression to compute from the sign bits is (AeB & AuAB)
// This is INT_MAX
unsigned M = -1U/2;
bool ret = (AeB & AuAB) > M;
return ret;
}
So if we had such an operation, but even more "relaxed" this could improve the situation even further.
Take3: If we use a special "cast" from the unsigned result to the signed one, this now is branch free:
#include <stdbool.h>
#include <stdatomic.h>
bool overadd(int a[static 1], int b) {
unsigned A = a[0];
//atomic_fetch_add_explicit(a, b, memory_order_relaxed);
unsigned B = b;
// This computation will be done anyhow
unsigned AB = A + B;
// See if the sign bits are equal
unsigned AeB = ~(A^B);
unsigned AuAB = (A^AB);
// The function result according to these should be:
//
// AeB \ AuAB | false | true
//------------+-------+------
// false | false | false
// true | false | true
//
// So the expression to compute from the sign bits is (AeB & AuAB)
// This is INT_MAX
unsigned M = -1U/2;
unsigned res = (AeB & AuAB);
signed N = M-1;
N = -N - 1;
a[0] = ((AB > M) ? -(int)(-AB) : ((AB != M) ? (int)AB : N));
return res > M;
}
The situation with signed operations is much worse than with unsigned ones, and I see only one pattern for signed addition, only for clang and only when a wider type is available:
int safe_add(int *value, int delta)
{
long long result = (long long)*value + delta;
if (result > INT_MAX || result < INT_MIN) {
return -1;
} else {
*value = result;
return 0;
}
}
clang gives exactly the same asm as with __builtin_add_overflow:
safe_add: # #safe_add
addl (%rdi), %esi
movl $-1, %eax
jo .LBB1_2
movl %esi, (%rdi)
xorl %eax, %eax
.LBB1_2:
retq
Otherwise, the simplest solution I can think of is this (with the interface as Jens used):
_Bool overadd(int a[static 1], int b)
{
// compute the unsigned sum
unsigned u = (unsigned)a[0] + b;
// convert it to signed
int sum = u <= -1u / 2 ? (int)u : -1 - (int)(-1 - u);
// see if it overflowed or not
_Bool overflowed = (b > 0) != (sum > a[0]);
// return the results
a[0] = sum;
return overflowed;
}
gcc and clang generate very similar asm. gcc gives this:
overadd:
movl (%rdi), %ecx
testl %esi, %esi
setg %al
leal (%rcx,%rsi), %edx
cmpl %edx, %ecx
movl %edx, (%rdi)
setl %dl
xorl %edx, %eax
ret
We want to compute the sum in unsigned, so unsigned have to be able to represent all values of int without any of them sticking together. To easily convert the result from unsigned to int, the opposite is useful too. Overall, two's complement is assumed.
On all popular platforms I think we can convert from unsigned to int by a simple assignment like int sum = u; but, as Jens mentioned, even the latest variant of the C2x standard allows it to raise a signal. The next most natural way is to do something like that: *(unsigned *)&sum = u; but non-trap variants of padding apparently could differ for signed and unsigned types. So the example above goes the hard way. Fortunately, both gcc and clang optimize this tricky conversion away.
P.S. The two variants above could not be compared directly as they have different behavior. The first one follows the original question and doesn't clobber the *value in case of overflow. The second one follows the answer from Jens and always clobbers the variable pointed to by the first parameter but it's branchless.
the best version I can come up with is:
int safe_add(int *value, int delta) {
long long t = *value + (long long)delta;
if (t != ((int)t))
return -1;
*value = (int) t;
return 0;
}
which produces:
safe_add(int*, int):
movslq %esi, %rax
movslq (%rdi), %rsi
addq %rax, %rsi
movslq %esi, %rax
cmpq %rsi, %rax
jne .L3
movl %eax, (%rdi)
xorl %eax, %eax
ret
.L3:
movl $-1, %eax
ret
I could get the compiler to use the sign flag by assuming (and asserting) a two’s complement representation without padding bytes. Such implementations should yield the required behaviour in the line annotated by a comment, although I can’t find a positive formal confirmation of this requirement in the standard (and there probably isn’t any).
Note that the following code only handles positive integer addition, but can be extended.
int safe_add(int* lhs, int rhs) {
_Static_assert(-1 == ~0, "integers are not two's complement");
_Static_assert(
1u << (sizeof(int) * CHAR_BIT - 1) == (unsigned) INT_MIN,
"integers have padding bytes"
);
unsigned value = *lhs;
value += rhs;
if ((int) value < 0) return -1; // impl. def., 6.3.1.3/3
*lhs = value;
return 0;
}
This yields on both clang and GCC:
safe_add:
add esi, DWORD PTR [rdi]
js .L3
mov DWORD PTR [rdi], esi
xor eax, eax
ret
.L3:
mov eax, -1
ret
I have the following assembly code from the C function long loop(long x, int n)
with x in %rdi, n in %esi on a 64 bit machine. I've written my comments on what I think the assembly instructions are doing.
loop:
movl %esi, %ecx // store the value of n in register ecx
movl $1, %edx // store the value of 1 in register edx (rdx).initial mask
movl $0, %eax //store the value of 0 in register eax (rax). this is initial return value
jmp .L2
.L3
movq %rdi, %r8 //store the value of x in register r8
andq %rdx, %r8 //store the value of (x & mask) in r8
orq %r8, %rax //update the return value rax by (x & mask | [rax] )
salq %cl, %rdx //update the mask rdx by ( [rdx] << n)
.L2
testq %rdx, %rdx //test mask&mask
jne .L3 // if (mask&mask) != 0, jump to L3
rep; ret
I have the following C function which needs to correspond to the assembly code:
long loop(long x, int n){
long result = _____ ;
long mask;
// for (mask = ______; mask ________; mask = ______){ // filled in as:
for (mask = 1; mask != 0; mask <<n) {
result |= ________;
}
return result;
}
I need some help filling in the blanks, I'm not a 100% sure what the assembly instructions but I've given it my best shot by commenting with each line.
You've pretty much got it in your comments.
long loop(long x, long n) {
long result = 0;
long mask;
for (mask = 1; mask != 0; mask <<= n) {
result |= (x & mask);
}
return result;
}
Because result is the return value, and the return value is stored in %rax, movl $0, %eax loads 0 into result initially.
Inside the for loop, %r8 holds the value that is or'd with result, which, like you mentioned in your comments, is just x & mask.
The function copies every nth bit to result.
For the record, the implementation is full of missed optimizations, especially if we're tuning for Sandybridge-family where bts reg,reg is only 1 uop with 1c latency, but shl %cl is 3 uops. (BTS is also 1 uop on Intel P6 / Atom / Silvermont CPUs)
bts is only 2 uops on AMD K8/Bulldozer/Zen. BTS reg,reg masks the shift count the same way x86 integer shifts do, so bts %rdx, %rax implements rax |= 1ULL << (rdx&0x3f). i.e. setting bit n in RAX.
(This code is clearly designed to be simple to understand, and doesn't even use the most well-known x86 peephole optimization, xor-zeroing, but it's fun to see how we can implement the same thing efficiently.)
More obviously, doing the and inside the loop is bad. Instead we can just build up a mask with every nth bit set, and return x & mask. This has the added advantage that with a non-ret instruction following the conditional branch, we don't need the rep prefix as padding for the ret even if we care about tuning for the branch predictors in AMD Phenom CPUs. (Because it isn't the first byte after a conditional branch.)
# x86-64 System V: x in RDI, n in ESI
mask_bitstride: # give the function a meaningful name
mov $1, %eax # mask = 1
mov %esi, %ecx # unsigned bitidx = n (new tmp var)
# the loop always runs at least 1 iteration, so just fall into it
.Lloop: # do {
bts %rcx, %rax # rax |= 1ULL << bitidx
add %esi, %ecx # bitidx += n
cmp $63, %ecx # sizeof(long)*CHAR_BIT - 1
jbe .Lloop # }while(bitidx <= maxbit); // unsigned condition
and %rdi, %rax # return x & mask
ret # not following a JCC so no need for a REP prefix even on K10
We assume n is in the 0..63 range because otherwise the C would have undefined behaviour. In that case, this implementation differs from the shift-based implementation in the question. The shl version would treat n==64 as an infinite loop, because shift count = 0x40 & 0x3f = 0, so mask would never change. This bitidx += n version would exit after the first iteration, because idx immediately becomes >=63, i.e. out of range.
A less extreme case is that n=65 would copy all the bits (shift count of 1); this would copy only the low bit.
Both versions create an infinite loop for n=0. I used an unsigned compare so negative n will exit the loop promptly.
On Intel Sandybridge-family, the inner loop in the original is 7 uops. (mov = 1 + and=1 + or=1 + variable-count-shl=3 + macro-fused test+jcc=1). This will bottleneck on the front-end, or on ALU throughput on SnB/IvB.
My version is only 3 uops, and can run about twice as fast. (1 iteration per clock.)
Disclaimer: I am well aware implementing your own crypto is a very bad idea. This is part of a master thesis, the code will not be used in practice.
As part of a larger cryptographic algorithm, I need to sort an array of constant length (small, 24 to be precise), without leaking any information on the contents of this array. As far as I know (please correct me if these are not sufficient to prevent timing and cache attacks), this means:
The sort should run in the same amount of cycles in terms of the length of the array, regardless of the particular values of the array
The sort should not branch or access memory depending on the particular values of the array
Do any such implementations exist? If not, are there any good resources on this type of programming?
To be honest, I'm even struggling with the easier subproblem, namely finding the smallest value of an array.
double arr[24]; // some input
double min = DBL_MAX;
int i;
for (i = 0; i < 24; ++i) {
if (arr[i] < min) {
min = arr[i];
}
}
Would adding an else with a dummy assignment be sufficient to make it timing-safe? If so, how do I ensure the compiler (GCC in my case) doesn't undo my hard work? Would this be susceptible to cache attacks?
Use a sorting network, a series of comparisons and swaps.
The swap call must not be dependent on the comparison. It must be implemented in a way to execute the same amount of instructions, regardless of the comparison result.
Like this:
void swap( int* a , int* b , bool c )
{
const int min = c ? b : a;
const int max = c ? a : b;
*a = min;
*b = max;
}
swap( &array[0] , &array[1] , array[0] > array[1] );
Then find the sorting network and use the swaps. Here is a generator that does that for you: http://pages.ripco.net/~jgamble/nw.html
Example for 4 elements, the numbers are array indices, generated by the above link:
SWAP(0, 1);
SWAP(2, 3);
SWAP(0, 2);
SWAP(1, 3);
SWAP(1, 2);
This is a very dumb bubble sort that actually works and doesn't branch or change memory access behavior depending on input data. Not sure if this can be plugged into another sorting algorithm, they need their compares separate from the swaps, but maybe it's possible, working on that now.
#include <stdint.h>
static void
cmp_and_swap(uint32_t *ap, uint32_t *bp)
{
uint32_t a = *ap;
uint32_t b = *bp;
int64_t c = (int64_t)a - (int64_t)b;
uint32_t sign = ((uint64_t)c >> 63);
uint32_t min = a * sign + b * (sign ^ 1);
uint32_t max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
void
timing_sort(uint32_t *arr, int n)
{
int i, j;
for (i = n - 1; i >= 0; i--) {
for (j = 0; j < i; j++) {
cmp_and_swap(&arr[j], &arr[j + 1]);
}
}
}
The cmp_and_swap function compiles to (Apple LLVM version 7.3.0 (clang-703.0.29), compiled with -O3):
_cmp_and_swap:
00000001000009e0 pushq %rbp
00000001000009e1 movq %rsp, %rbp
00000001000009e4 movl (%rdi), %r8d
00000001000009e7 movl (%rsi), %r9d
00000001000009ea movq %r8, %rdx
00000001000009ed subq %r9, %rdx
00000001000009f0 shrq $0x3f, %rdx
00000001000009f4 movl %edx, %r10d
00000001000009f7 negl %r10d
00000001000009fa orl $-0x2, %edx
00000001000009fd incl %edx
00000001000009ff movl %r9d, %ecx
0000000100000a02 andl %edx, %ecx
0000000100000a04 andl %r8d, %edx
0000000100000a07 movl %r8d, %eax
0000000100000a0a andl %r10d, %eax
0000000100000a0d addl %eax, %ecx
0000000100000a0f andl %r9d, %r10d
0000000100000a12 addl %r10d, %edx
0000000100000a15 movl %ecx, (%rdi)
0000000100000a17 movl %edx, (%rsi)
0000000100000a19 popq %rbp
0000000100000a1a retq
0000000100000a1b nopl (%rax,%rax)
Only memory accesses are reading and writing of the array, no branches. The compiler did figure out what the multiplication actually does, quite clever actually, but it didn't use branches for that.
The casts to int64_t are necessary to avoid overflows. I'm pretty sure it can be written cleaner.
As requested, here's a compare function for doubles:
void
cmp_and_swap(double *ap, double *bp)
{
double a = *ap;
double b = *bp;
int sign = signbit(a - b);
double min = a * sign + b * (sign ^ 1);
double max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
Compiled code is branchless and doesn't change memory access pattern depending on input data.
A very trivial, time-constant (but also highly in-efficient) sort is to
have a src and destination array
for each element in the (sorted) destination array, iterate through the complete source array to find the element that belongs exactly into this position.
No early breaks, (nearly) constant timing, not depending on even partial sortedness of the source.
I was asked for a challenge to change the endianess of an int. The idea I had was to use bitshifts
int swap_endianess(int color)
{
int a;
int r;
int g;
int b;
a = (color & (255 << 24)) >> 24;
r = (color & (255 << 16)) >> 16;
g = (color & (255 << 8)) >> 8;
b = (color & 255)
return (b << 24 | g << 16 | r << 8 | a);
}
But someone told me that it was more easy to use a union containing an int and an array of four chars (if an int is stored on 4 chars), fill the int and then reverse the array.
union u_color
{
int color;
char c[4];
};
int swap_endianess(int color)
{
union u_color ucol;
char tmp;
ucol.color = color;
tmp = ucol.c[0];
ucol.c[0] = ucol.c[3];
ucol.c[3] = tmp;
tmp = ucol.c[1];
ucol.c[1] = ucol.c[2];
ucol.c[2] = tmp;
return (ucol.color);
}
What is the more efficient way of swapping bytes between those two? Are there more efficient ways of doing this?
EDIT
After having tested on an I7, the union way takes about 24 seconds (measured with time command), while the bitshift way takes about 15 seconds on 2,000,000,000 iterations.
The is that if I compile with -O1, both of the methods will take only 1 second, and 0.001 second with -O2 or -O3.
The bitshift methods compile to bswap in ASM with -02 and -03, but not the union way, gcc seems to recognize the naive pattern but not the complicated union way to do it. To conclude, read the bottom line of #user3386109.
Here is the correct code for a byte swap function
uint32_t changeEndianess( uint32_t value )
{
uint32_t r, g, b, a;
r = (value >> 24) & 0xff;
g = (value >> 16) & 0xff;
b = (value >> 8) & 0xff;
a = value & 0xff;
return (a << 24) | (b << 16) | (g << 8) | r;
}
Here's a function that tests the byte swap function
void testEndianess( void )
{
uint32_t value = arc4random();
uint32_t result = changeEndianess( value );
printf( "%08x %08x\n", value, result );
}
Using the LLVM compiler with full optimization, the resulting assembly code for the testEndianess function is
0x93d0: calll 0xc82e ; call `arc4random`
0x93d5: movl %eax, %ecx ; copy `value` into register CX
0x93d7: bswapl %ecx ; <--- this is the `changeEndianess` function
0x93d9: movl %ecx, 0x8(%esp) ; put 'result' on the stack
0x93dd: movl %eax, 0x4(%esp) ; put 'value' on the stack
0x93e1: leal 0x6536(%esi), %eax ; compute address of the format string
0x93e7: movl %eax, (%esp) ; put the format string on the stack
0x93ea: calll 0xc864 ; call 'printf'
In other words, the LLVM compiler recognizes the entire changeEndianess function and implements it as a single bswapl instruction.
Side note for those wondering why the call to arc4random is necessary. Given this code
void testEndianess( void )
{
uint32_t value = 0x11223344;
uint32_t result = changeEndianess( value );
printf( "%08x %08x\n", value, result );
}
the compiler generates this assembly
0x93dc: leal 0x6524(%eax), %eax ; compute address of format string
0x93e2: movl %eax, (%esp) ; put the format string on the stack
0x93e5: movl $0x44332211, 0x8(%esp) ; put 'result' on the stack
0x93ed: movl $0x11223344, 0x4(%esp) ; put 'value' on the stack
0x93f5: calll 0xc868 ; call 'printf'
In other words, given a hardcoded value as input, the compiler precomputes the result of the changeEndianess function, and puts that directly into the assembly code, bypassing the function entirely.
The bottom line. Write your code the way it makes sense to write your code, and let the compiler do the optimizing. Compilers these days are amazing. Using tricky optimizations in source code (e.g. unions) may defeat the optimizations built into the compiler, actually resulting in slower code.
You can also use this code which might be slightly more efficient:
#include <stdint.h>
extern uint32_t
change_endianness(uint32_t x)
{
x = (x & 0x0000FFFFLU) << 16 | (x & 0xFFFF0000LU) >> 16;
x = (x & 0x00FF00FFLU) << 8 | (x & 0xFF00FF00LU) >> 8;
return (x);
}
This is compiled by gcc on amd64 to the following assembly:
change_endianness:
roll $16, %edi
movl %edi, %eax
andl $16711935, %edi
andl $-16711936, %eax
salq $8, %rdi
sarq $8, %rax
orl %edi, %eax
ret
To get an even better result, you might want to employ embedded assembly. The i386 and amd64 architectures provide a bswap instruction to do what you want. As user3386109 explained, compilers might recognize the “naïve” approach and emit bswap instructions, something that doesn't happen with the approach from above. It is however better in case the compiler is not smart enough to detect that it can use bswap.
The following little program is very awkward using GCC version 4.2.1 (Apple Inc. build 5664) on a Mac.
#include <stdio.h>
int main(){
int x = 1 << 32;
int y = 32;
int z = 1 << y;
printf("x:%d, z: %d\n", x, z);
}
The result is x:0, z: 1.
Any idea why the values of x and z are different?
Thanks a lot.
Short answer: the Intel processor masks the shift count to 5 bits (maximum 31). In other words, the shift actually performed is 32 & 31, which is 0 (no change).
The same result appears using gcc on a Linux 32-bit PC.
I assembled a shorter version of this program because I was puzzled by why a left shift of 32 bits should result in a non-zero value at all:
int main(){
int y = 32;
unsigned int z = 1 << y;
unsigned int k = 1;
k <<= y;
printf("z: %u, k: %u\n", z, k);
}
..using the command gcc -Wall -o a.s -S deleteme.c (comments are my own)
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $36, %esp
movl $32, -16(%ebp) ; y = 32
movl -16(%ebp), %ecx ; 32 in CX register
movl $1, %eax ; AX = 1
sall %cl, %eax ; AX <<= 32(32)
movl %eax, -12(%ebp) ; z = AX
movl $1, -8(%ebp) ; k = 1
movl -16(%ebp), %ecx ; CX = y = 32
sall %cl, -8(%ebp) ; k <<= CX(32)
movl -8(%ebp), %eax ; AX = k
movl %eax, 8(%esp)
movl -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
call printf
addl $36, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
Ok so what does this mean? It's this instruction that puzzles me:
sall %cl, -8(%ebp) ; k <<= CX(32)
Clearly k is being shifted left by 32 bits.
You've got me - it's using the sall instruction which is an arithmetic shift. I don't know why rotating this by 32 results in the bit re-appearing in the initial position. My initial conjecture would be that the processor is optimised to perform this instruction in one clock cycle - which means that any shift by more than 31 would be regarded as a don't care. But I'm curious to find the answer to this because I would expect that the rotate should result in all bits falling off the left end of the data type.
I found a link to http://faydoc.tripod.com/cpu/sal.htm which explains that the shift count (in the CL register) is masked to 5 bits. This means that if you tried to shift by 32 bits the actual shift performed would be by zero bits (i.e. no change). There's the answer!
If your ints are 32 bits or shorter, the behaviour is undefined ... and undefined behaviour cannot be explained.
The Standard says:
6.5.7/3 [...] If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
You can check your int width bit size, for example with:
#include <limits.h>
#include <stdio.h>
int main(void) {
printf("bits in an int: %d\n", CHAR_BIT * (int)sizeof (int));
return 0;
}
And you can check your int width (there can be padding bits), for example with:
#include <limits.h>
#include <stdio.h>
int main(void) {
int width = 0;
int tmp = INT_MAX;
while (tmp) {
tmp >>= 1;
width++;
}
printf("width of an int: %d\n", width + 1 /* for the sign bit */);
return 0;
}
Standard 6.2.6.2/2: For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; there shall be exactly one sign bit
The C99 standard says that the result of shifting a number by the width in bits (or more) of the operand is undefined. Why?
Well this allows compilers to create the most efficient code for a particular architecture. For instance, the i386 shift instruction uses a five bit wide field for the number of bits to shift a 32 bit operand by. The C99 standard allows the compiler to simply take the bottom five bits of the shift count and put them in the field. Clearly this means that a shift of 32 bits (= 100000 in binary) is therefore identical to a shift of 0 and the result will therefore be the left operand unchanged.
A different CPU architecture might use a wider bit field, say 32 bits. The compiler can still put the shift count directly in the field but this time the result will be 0 because a shift of 32 bits will shift all the bits out of the left operand.
If the C99 defined one or other of these behaviours as correct, either the compiler for Intel has to put special checking in for shift counts that are too big or the compiler for non i386 has to mask the shift count.
The reason why
int x = 1 << 32;
and
int z = 1 << y;
give different results is because the first calculation is a constant expression and can be performed entirely by the compiler. The compiler must be calculating constant expressions by using 64 bit arithmetic. The second expression is calculated by the code generated by the compiler. Since the type of both y and z is int the code generates a calculation using 32 bit wide ints (int is 32 bits on both i386 and x86_64 with gcc on Apple).
In my mind "int x = y << 32;" does not make sense if sizeof(int)==4.
But I had a similar issue with:
long y = ...
long x = y << 32;
Where I got a warning "warning: left shift count >= width of type" even though sizeof(long) was 8 on the target in question. I got rid of the warning by doing this instead:
long x = (y << 16) << 16;
And that seemed to work.
On a 64 bit architecture there was no warning. On a 32 bit architecture there was.