The following little program is very awkward using GCC version 4.2.1 (Apple Inc. build 5664) on a Mac.
#include <stdio.h>
int main(){
int x = 1 << 32;
int y = 32;
int z = 1 << y;
printf("x:%d, z: %d\n", x, z);
}
The result is x:0, z: 1.
Any idea why the values of x and z are different?
Thanks a lot.
Short answer: the Intel processor masks the shift count to 5 bits (maximum 31). In other words, the shift actually performed is 32 & 31, which is 0 (no change).
The same result appears using gcc on a Linux 32-bit PC.
I assembled a shorter version of this program because I was puzzled by why a left shift of 32 bits should result in a non-zero value at all:
int main(){
int y = 32;
unsigned int z = 1 << y;
unsigned int k = 1;
k <<= y;
printf("z: %u, k: %u\n", z, k);
}
..using the command gcc -Wall -o a.s -S deleteme.c (comments are my own)
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $36, %esp
movl $32, -16(%ebp) ; y = 32
movl -16(%ebp), %ecx ; 32 in CX register
movl $1, %eax ; AX = 1
sall %cl, %eax ; AX <<= 32(32)
movl %eax, -12(%ebp) ; z = AX
movl $1, -8(%ebp) ; k = 1
movl -16(%ebp), %ecx ; CX = y = 32
sall %cl, -8(%ebp) ; k <<= CX(32)
movl -8(%ebp), %eax ; AX = k
movl %eax, 8(%esp)
movl -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
call printf
addl $36, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
Ok so what does this mean? It's this instruction that puzzles me:
sall %cl, -8(%ebp) ; k <<= CX(32)
Clearly k is being shifted left by 32 bits.
You've got me - it's using the sall instruction which is an arithmetic shift. I don't know why rotating this by 32 results in the bit re-appearing in the initial position. My initial conjecture would be that the processor is optimised to perform this instruction in one clock cycle - which means that any shift by more than 31 would be regarded as a don't care. But I'm curious to find the answer to this because I would expect that the rotate should result in all bits falling off the left end of the data type.
I found a link to http://faydoc.tripod.com/cpu/sal.htm which explains that the shift count (in the CL register) is masked to 5 bits. This means that if you tried to shift by 32 bits the actual shift performed would be by zero bits (i.e. no change). There's the answer!
If your ints are 32 bits or shorter, the behaviour is undefined ... and undefined behaviour cannot be explained.
The Standard says:
6.5.7/3 [...] If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
You can check your int width bit size, for example with:
#include <limits.h>
#include <stdio.h>
int main(void) {
printf("bits in an int: %d\n", CHAR_BIT * (int)sizeof (int));
return 0;
}
And you can check your int width (there can be padding bits), for example with:
#include <limits.h>
#include <stdio.h>
int main(void) {
int width = 0;
int tmp = INT_MAX;
while (tmp) {
tmp >>= 1;
width++;
}
printf("width of an int: %d\n", width + 1 /* for the sign bit */);
return 0;
}
Standard 6.2.6.2/2: For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; there shall be exactly one sign bit
The C99 standard says that the result of shifting a number by the width in bits (or more) of the operand is undefined. Why?
Well this allows compilers to create the most efficient code for a particular architecture. For instance, the i386 shift instruction uses a five bit wide field for the number of bits to shift a 32 bit operand by. The C99 standard allows the compiler to simply take the bottom five bits of the shift count and put them in the field. Clearly this means that a shift of 32 bits (= 100000 in binary) is therefore identical to a shift of 0 and the result will therefore be the left operand unchanged.
A different CPU architecture might use a wider bit field, say 32 bits. The compiler can still put the shift count directly in the field but this time the result will be 0 because a shift of 32 bits will shift all the bits out of the left operand.
If the C99 defined one or other of these behaviours as correct, either the compiler for Intel has to put special checking in for shift counts that are too big or the compiler for non i386 has to mask the shift count.
The reason why
int x = 1 << 32;
and
int z = 1 << y;
give different results is because the first calculation is a constant expression and can be performed entirely by the compiler. The compiler must be calculating constant expressions by using 64 bit arithmetic. The second expression is calculated by the code generated by the compiler. Since the type of both y and z is int the code generates a calculation using 32 bit wide ints (int is 32 bits on both i386 and x86_64 with gcc on Apple).
In my mind "int x = y << 32;" does not make sense if sizeof(int)==4.
But I had a similar issue with:
long y = ...
long x = y << 32;
Where I got a warning "warning: left shift count >= width of type" even though sizeof(long) was 8 on the target in question. I got rid of the warning by doing this instead:
long x = (y << 16) << 16;
And that seemed to work.
On a 64 bit architecture there was no warning. On a 32 bit architecture there was.
Related
I have the following assembly code from the C function long loop(long x, int n)
with x in %rdi, n in %esi on a 64 bit machine. I've written my comments on what I think the assembly instructions are doing.
loop:
movl %esi, %ecx // store the value of n in register ecx
movl $1, %edx // store the value of 1 in register edx (rdx).initial mask
movl $0, %eax //store the value of 0 in register eax (rax). this is initial return value
jmp .L2
.L3
movq %rdi, %r8 //store the value of x in register r8
andq %rdx, %r8 //store the value of (x & mask) in r8
orq %r8, %rax //update the return value rax by (x & mask | [rax] )
salq %cl, %rdx //update the mask rdx by ( [rdx] << n)
.L2
testq %rdx, %rdx //test mask&mask
jne .L3 // if (mask&mask) != 0, jump to L3
rep; ret
I have the following C function which needs to correspond to the assembly code:
long loop(long x, int n){
long result = _____ ;
long mask;
// for (mask = ______; mask ________; mask = ______){ // filled in as:
for (mask = 1; mask != 0; mask <<n) {
result |= ________;
}
return result;
}
I need some help filling in the blanks, I'm not a 100% sure what the assembly instructions but I've given it my best shot by commenting with each line.
You've pretty much got it in your comments.
long loop(long x, long n) {
long result = 0;
long mask;
for (mask = 1; mask != 0; mask <<= n) {
result |= (x & mask);
}
return result;
}
Because result is the return value, and the return value is stored in %rax, movl $0, %eax loads 0 into result initially.
Inside the for loop, %r8 holds the value that is or'd with result, which, like you mentioned in your comments, is just x & mask.
The function copies every nth bit to result.
For the record, the implementation is full of missed optimizations, especially if we're tuning for Sandybridge-family where bts reg,reg is only 1 uop with 1c latency, but shl %cl is 3 uops. (BTS is also 1 uop on Intel P6 / Atom / Silvermont CPUs)
bts is only 2 uops on AMD K8/Bulldozer/Zen. BTS reg,reg masks the shift count the same way x86 integer shifts do, so bts %rdx, %rax implements rax |= 1ULL << (rdx&0x3f). i.e. setting bit n in RAX.
(This code is clearly designed to be simple to understand, and doesn't even use the most well-known x86 peephole optimization, xor-zeroing, but it's fun to see how we can implement the same thing efficiently.)
More obviously, doing the and inside the loop is bad. Instead we can just build up a mask with every nth bit set, and return x & mask. This has the added advantage that with a non-ret instruction following the conditional branch, we don't need the rep prefix as padding for the ret even if we care about tuning for the branch predictors in AMD Phenom CPUs. (Because it isn't the first byte after a conditional branch.)
# x86-64 System V: x in RDI, n in ESI
mask_bitstride: # give the function a meaningful name
mov $1, %eax # mask = 1
mov %esi, %ecx # unsigned bitidx = n (new tmp var)
# the loop always runs at least 1 iteration, so just fall into it
.Lloop: # do {
bts %rcx, %rax # rax |= 1ULL << bitidx
add %esi, %ecx # bitidx += n
cmp $63, %ecx # sizeof(long)*CHAR_BIT - 1
jbe .Lloop # }while(bitidx <= maxbit); // unsigned condition
and %rdi, %rax # return x & mask
ret # not following a JCC so no need for a REP prefix even on K10
We assume n is in the 0..63 range because otherwise the C would have undefined behaviour. In that case, this implementation differs from the shift-based implementation in the question. The shl version would treat n==64 as an infinite loop, because shift count = 0x40 & 0x3f = 0, so mask would never change. This bitidx += n version would exit after the first iteration, because idx immediately becomes >=63, i.e. out of range.
A less extreme case is that n=65 would copy all the bits (shift count of 1); this would copy only the low bit.
Both versions create an infinite loop for n=0. I used an unsigned compare so negative n will exit the loop promptly.
On Intel Sandybridge-family, the inner loop in the original is 7 uops. (mov = 1 + and=1 + or=1 + variable-count-shl=3 + macro-fused test+jcc=1). This will bottleneck on the front-end, or on ALU throughput on SnB/IvB.
My version is only 3 uops, and can run about twice as fast. (1 iteration per clock.)
I'm trying to convert the following code into a single line using leal.
movl 4(%esp), %eax
sall $2, %eax
addl 8(%esp), %eax
addl $4, %eax
My question is of 3 parts:
Does the '%' in front of the register simply define the following string as a register?
Does the '$' in front of the integers define the following value type as int?
Is leal 4(%rsi, 4, %rdi), %eax a correct conversion from the above assembly? (ignoring the change from 32-bit to 64-bit)
Edit: Another question. would
unsigned int fun3(unsigned int x, unsigned int y)
{
unsigned int *z = &x;
unsigned int w = 4+y;
return (4*(*z)+w);
}
generate the above code? I'm unfamiliar with pointers.
1: if % yes
2: there is no int or float or bool or char or... in asm. You are dealing with the machine. It means it is a constant
3: 1 move value in (esp - 4) to eax. esp is the stack pointer, eax is the register used by c function to return values.
2 shift to left two times. same as multiply by 4
3 add value in (esp - 8) to value in eax
4 add 4 to value in eax
x*4+y+4 = eax x is (esp -4), y is (esp-8)
leal is the same as, 4+rsi+4*rdi =eax
so yes it the same in a way.
That depend on the compiler, but yes that is valid translation. 4*x+y+4
I am trying to interpret the following IA32 assembler code and write a function in C that will have an equivalent effect.
Let's say that parameters a, b and c are stored at memory locations with offsets 8, 12 and 16 relative to the address in register %ebp, and that an appropriate function prototype in C would be equivFunction(int a, int b, int c);
movl 12(%ebp), %edx // store b into %edx
subl 16(%ebp), %edx // %edx = b - c
movl %edx, %eax // store b - c into %eax
sall $31, %eax // multiply (b-c) * 2^31
sarl $31, %eax // divide ((b-c)*2^31)) / 2^31
imull 8(%ebp), %edx // multiply a * (b - c) into %edx
xorl %edx, %eax // exclusive or? %edx or %eax ? what is going on here?
First, did I interpret the assembly correctly? If so, how would I go about translating this into C?
The sall/sarl combo has the effect of setting all bits of eax to the value of the zeroth bit. First, sal moves the 0th bit to the 31st position, making it a sign bit. Then sar moves it back, filling the rest of the register with its copy. Don't think of it as division/multiplication - think of it as bitwise shift, which "s" actually stands for.
So eax is 0xffffffff (-1) if b-c is odd, 0 if even. So the imull command places into edx either a negative of a, or zero. The final xor, then, either inverts the all bits of a (that's what xor with one does) or leaves the zero value be.
This whole snippet has an air of artificiality. Is this homework?
The shifts manipulate the sign bit directly, rather than multiplying/dividing, so the code is roughly
int eqivFunction(int a, int b, int c) {
int t1 = b - c;
unsigned t2 = t1 < 0 ? ~0U : 0;
return (a * t1) ^ t2;
}
Alternately:
int eqivFunction(int a, int b, int c) {
int t1 = b - c;
int t2 = a * t1;
if (t1 < 0) t2 = -t2 - 1;
return t2;
}
Of course, the C code has undefined behavior on integer overflow, while the assembly code is well-defined, so the C code might not do the same thing in all cases (particularly if you compile it on a different architecture)
today, I have been trying to write a function, which should rotate a given 64 bit integer n bits to the right, but also to the left, if the n is negative. Of course, bits out of the integer shall be rotated in on the other side.
I have kept the function quite simple.
void rotate(uint64_t *i, int n)
uint64_t one = 1;
if(n > 0) {
do {
int storeBit = *i & one;
*i = *i >> 1;
if(storeBit == 1)
*i |= 0x80000000000000;
n--;
}while(n>0);
}
}
possible inputs are:
uint64_t num = 0x2;
rotate(&num, 1); // num should be 0x1
rotate(&num, -1); // num should be 0x2, again
rotate(&num, 62); // num should 0x8
Unfortunately, I could not figure it out. I was hoping someone could help me.
EDIT: Now, the code is online. Sry, it took a while. I had some difficulties with the editor. But I just did it for the rotation to the right. The rotation to the left is missing, because I did not do it.
uint64_t rotate(uint64_t v, int n) {
n = n & 63U;
if (n)
v = (v >> n) | (v << (64-n));
return v; }
gcc -O3 produces:
.cfi_startproc
andl $63, %esi
movq %rdi, %rdx
movq %rdi, %rax
movl %esi, %ecx
rorq %cl, %rdx
testl %esi, %esi
cmovne %rdx, %rax
ret
.cfi_endproc
not perfect, but reasonable.
int storeBit = *i & one;
#This line you are assigning an 64 bit unsigned integer to probably 4 byte integer. I think your problem is related to this. In little endian machines things will be complicated if you do, non-defined operations.
if(n > 0)
doesnt takes negative n
I am trying to understand how calculations involving numbers greater than 232 happen on a 32 bit machine.
C code
$ cat size.c
#include<stdio.h>
#include<math.h>
int main() {
printf ("max unsigned long long = %llu\n",
(unsigned long long)(pow(2, 64) - 1));
}
$
gcc output
$ gcc size.c -o size
$ ./size
max unsigned long long = 18446744073709551615
$
Corresponding assembly code
$ gcc -S size.c -O3
$ cat size.s
.file "size.c"
.section .rodata.str1.4,"aMS",#progbits,1
.align 4
.LC0:
.string "max unsigned long long = %llu\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $-1, 8(%esp) #1
movl $-1, 12(%esp) #2
movl $.LC0, 4(%esp) #3
movl $1, (%esp) #4
call __printf_chk
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
$
What exactly happens on the lines 1 - 4?
Is this some kind of string concatenation at the assembly level?
__printf_chk is a wrapper around printf which checks for stack overflow, and takes an additional first parameter, a flag (e.g. see here.)
pow(2, 64) - 1 has been optimised to 0xffffffffffffffff as the arguments are constants.
As per the usual calling conventions, the first argument to __printf_chk() (int flag) is a 32-bit value on the stack (at %esp at the time of the call instruction). The next argument, const char * format, is a 32-bit pointer (the next 32-bit word on the stack, i.e. at %esp+4). And the 64-bit quantity that is being printed occupies the next two 32-bit words (at %esp+8 and %esp+12):
pushl %ebp ; prologue
movl %esp, %ebp ; prologue
andl $-16, %esp ; align stack pointer
subl $16, %esp ; reserve bytes for stack frame
movl $-1, 8(%esp) #1 ; store low half of 64-bit argument (a constant) to stack
movl $-1, 12(%esp) #2 ; store high half of 64-bit argument (a constant) to stack
movl $.LC0, 4(%esp) #3 ; store address of format string to stack
movl $1, (%esp) #4 ; store "flag" argument to __printf_chk to stack
call __printf_chk ; call routine
leave ; epilogue
ret ; epilogue
The compiler has effectively rewritten this:
printf("max unsigned long long = %llu\n", (unsigned long long)(pow(2, 64) - 1));
...into this:
__printf_chk(1, "max unsigned long long = %llu\n", 0xffffffffffffffffULL);
...and, at runtime, the stack layout for the call looks like this (showing the stack as 32-bit words, with addresses increasing from the bottom of the diagram upwards):
: :
: Stack :
: :
+-----------------+
%esp+12 | 0xffffffff | \
+-----------------+ } <-------------------------------------.
%esp+8 | 0xffffffff | / |
+-----------------+ |
%esp+4 |address of string| <---------------. |
+-----------------+ | |
%esp | 1 | <--. | |
+-----------------+ | | |
__printf_chk(1, "max unsigned long long = %llu\n", |
0xffffffffffffffffULL);
similar to the way as we handle numbers greater than 9, with only digits 0 - 9.
(using positional digits). presuming the question is a conceptual one.
In your case, the compiler knows that 2^64-1 is just 0xffffffffffffffff, so it has pushed -1 (low dword) and -1 (high dword) onto the stack as your argument to printf. It's just an optimization.
In general, 64-bit numbers (and even greater values) can be stored with multiple words, e.g. an unsigned long long uses two dwords. To add two 64-bit numbers, two additions are performed - one on the low 32 bits, and one on the high 32 bits, plus the carry:
; Add 64-bit number from esi onto edi:
mov eax, [esi] ; get low 32 bits of source
add [edi], eax ; add to low 32 bits of destination
; That add may have overflowed, and if it did, carry flag = 1.
mov eax, [esi+4] ; get high 32 bits of source
adc [edi+4], eax ; add to high 32 bits of destination, then add carry.
You can repeat this sequence of add and adcs as much as you like to add arbitrarily big numbers. The same thing can be done with subtraction - just use sub and sbb (subtract with borrow).
Multiplication and division are much more complicated, and the compiler usually produces some small helper functions to deal with these whenever you multiply 64-bit numbers together. Packages like GMP which support very, very large integers use SSE/SSE2 to speed things up. Take a look at this Wikipedia article for more information on multiplication algorithms.
As others have pointed out all 64-bit aritmetic in your example has been optimised away. This answer focuses on the question int the title.
Basically we treat each 32-bit number as a digit and work in base 4294967296. In this manner we can work on arbiterally big numbers.
Addition and subtraction are easiest. We work through the digits one at a time starting from the least significant and moving to the most significant. Generally the first digit is done with a normal add/subtract instruction and later digits are done using a specific "add with carry" or "subtract with borrow" instruction. The carry flag in the status register is used to take the carry/borrow bit from one digit to the next. Thanks to twos complement signed and unsigned addition and subtraction are the same.
Multiplication is a little trickier, multiplying two 32-bit digits can produce a 64-bit result. Most 32-bit processors will have instructions that multiply two 32-bit numbers and produces a 64-bit result in two registers. Addition will then be needed to combine the results into a final answer. Thanks to twos complement signed and unsigned multiplication are the same provided the desired result size is the same as the argument size. If the result is larger than the arguments then special care is needed.
For comparision we start from the most significant digit. If it's equal we move down to the next digit until the results are equal.
Division is too complex for me to describe in this post, but there are plenty of examples out there of algorithms. e.g. http://www.hackersdelight.org/hdcodetxt/divDouble.c.txt
Some real-world examples from gcc https://godbolt.org/g/NclqXC , the assembler is in intel syntax.
First an addition. adding two 64-bit numbers and producing a 64-bit result. The asm is the same for both signed and unsigned versions.
int64_t add64(int64_t a, int64_t b) { return a + b; }
add64:
mov eax, DWORD PTR [esp+12]
mov edx, DWORD PTR [esp+16]
add eax, DWORD PTR [esp+4]
adc edx, DWORD PTR [esp+8]
ret
This is pretty simple, load one argument into eax and edx, then add the other using an add followed by an add with carry. The result is left in eax and edx for return to the caller.
Now a multiplication of two 64-bit numbers to produce a 64-bit result. Again the code doesn't change from signed to unsigned. I've added some comments to make it easier to follow.
Before we look at the code lets consider the math. a and b are 64-bit numbers I will use lo() to represent the lower 32-bits of a 64-bit number and hi() to represent the upper 32 bits of a 64-bit number.
(a * b) = (lo(a) * lo(b)) + (hi(a) * lo(b) * 2^32) + (hi(b) * lo(a) * 2^32) + (hi(b) * hi(a) * 2^64)
(a * b) mod 2^64 = (lo(a) * lo(b)) + (lo(hi(a) * lo(b)) * 2^32) + (lo(hi(b) * lo(a)) * 2^32)
lo((a * b) mod 2^64) = lo(lo(a) * lo(b))
hi((a * b) mod 2^64) = hi(lo(a) * lo(b)) + lo(hi(a) * lo(b)) + lo(hi(b) * lo(a))
uint64_t mul64(uint64_t a, uint64_t b) { return a*b; }
mul64:
push ebx ;save ebx
mov eax, DWORD PTR [esp+8] ;load lo(a) into eax
mov ebx, DWORD PTR [esp+16] ;load lo(b) into ebx
mov ecx, DWORD PTR [esp+12] ;load hi(a) into ecx
mov edx, DWORD PTR [esp+20] ;load hi(b) into edx
imul ecx, ebx ;ecx = lo(hi(a) * lo(b))
imul edx, eax ;edx = lo(hi(b) * lo(a))
add ecx, edx ;ecx = lo(hi(a) * lo(b)) + lo(hi(b) * lo(a))
mul ebx ;eax = lo(low(a) * lo(b))
;edx = hi(low(a) * lo(b))
pop ebx ;restore ebx.
add edx, ecx ;edx = hi(low(a) * lo(b)) + lo(hi(a) * lo(b)) + lo(hi(b) * lo(a))
ret
Finally when we try a division we see.
int64_t div64(int64_t a, int64_t b) { return a/b; }
div64:
sub esp, 12
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
call __divdi3
add esp, 28
ret
The compiler has decided that division is too complex to implement inline and instead calls a library routine.
The compiler actually made a static optimization of your code.
lines #1 #2 #3 are parameters for printf()
As #Pafy mentions, the compiler has evaluated this as a constant.
2 to the 64th minus 1 is 0xffffffffffffffff.
As 2 32-bit integers this is: 0xffffffff and 0xffffffff, which if you take that as a pair of 32-bit signed types, ends up as: -1, and -1.
Thus for your compiler the code generated happens to be equivalent to:
printf("max unsigned long long = %llu\n", -1, -1);
In the assembly it's written like this:
movl $-1, 8(%esp) #Second -1 parameter
movl $-1, 12(%esp) #First -1 parameter
movl $.LC0, 4(%esp) #Format string
movl $1, (%esp) #A one. Kind of odd, perhaps __printf_chk
#in your C library expects this.
call __printf_chk
By the way, a better way to calculate powers of 2 is to shift 1 left. Eg. (1ULL << 64) - 1.
No one in this thread noticed that the OP asked to explain the first 4 lines, not lines 11-14.
The first 4 lines are:
.file "size.c"
.section .rodata.str1.4,"aMS",#progbits,1
.align 4
.LC0:
Here's what happens in first 4 lines:
.file "size.c"
This is an assembler directive that says that we are about to start a new logical file called "size.c".
.section .rodata.str1.4,"aMS",#progbits,1
This is also a directive for read only strings in the program.
.align 4
This directive sets the location counter to always be a multiple of 4.
.LC0:
This is a label LC0 that can be jumped to, for example.
I hope I provided the right answer to the question as I answered exactly what OP asked.