I'm looking over an example on assembly in CSAPP (Computer Systems - A programmer's Perspective 2nd) and I just want to know if my understanding of the assembly code is correct.
Practice problem 3.23
int fun_b(unsigned x) {
int val = 0;
int i;
for ( ____;_____;_____) {
}
return val;
}
The gcc C compiler generates the following assembly code:
x at %ebp+8
// what I've gotten so far
1 movl 8(%ebp), %ebx // ebx: x
2 movl $0, %eax // eax: val, set to 0 since eax is where the return
// value is stored and val is being returned at the end
3 movl $0, %ecx // ecx: i, set to 0
4 .L13: // loop
5 leal (%eax,%eax), %edx // edx = val+val
6 movl %ebx, %eax // val = x (?)
7 andl $1, %eax // x = x & 1
8 orl %edx, %eax // x = (val+val) | (x & 1)
9 shrl %ebx Shift right by 1 // x = x >> 1
10 addl $1, %ecx // i++
11 cmpl $32, %ecx // if i < 32 jump back to loop
12 jne .L13
There was a similar post on the same problem with the solution, but I'm looking for more of a walk-through and explanation of the assembly code line by line.
You already seem to have the meaning of the instructions figured out. The comment on lines 7-8 are slightly wrong however, because those assign to eax which is val not x:
7 andl $1, %eax // val = val & 1 = x & 1
8 orl %edx, %eax // val = (val+val) | (x & 1)
Putting this into the C template could be:
for(i = 0; i < 32; i++, x >>= 1) {
val = (val + val) | (x & 1);
}
Note that (val + val) is just a left shift, so what this function is doing is shifting out bits from x on the right and shifting them in into val from the right. As such, it's mirroring the bits.
PS: if the body of the for must be empty you can of course merge it into the third expression.
Related
I'm having trouble understanding how the following equivalents work:
x86-64:
/*long loop(long x, int n)*/
/*x in %rdi, n in %esi*/
1.loop:
2 movl %esi, %ecx
3 movl $1, %edx
4 movl $0, %eax
5 jmp .L2
6.L3:
7 movq %rdi,%r8
8 andq %rdx,%r8
9 orq %r8,%rax
10 salq %cl,%rdx
11.L2
12 testq %rdx,%rdx
13 jne .L3
14 rep; ret
C:
1 long loop(long x, int n)
2 {
3 long result = 0;
4 long mask;
5 for (mask = 1; mask != 0; mask = mask << n) {
6 result |= (x & mask);
7 }
8 return result;
9 }
From what I see,
n = %esi and is copied into %ecx.
1 is copied into mask.
0 is copied into result.
I would like to know why 1 is copied to mask when the first variable in the C code is result? Wouldn't result = 1 and mask = 0 since that is the correct order in the C program? Furthermore, when I convert the C code to assembly language, I get:
1.loop:
2 movl %rsi, %rcx
3 movl $1, %eax
4 movl $0, %edx
5 jmp .L2
...
So are the registers %eax and %edx interchangeable?
I'm to convert the following AT&T x86 assembly into C:
movl 8(%ebp), %edx
movl $0, %eax
movl $0, %ecx
jmp .L2
.L1
shll $1, %eax
movl %edx, %ebx
andl $1, %ebx
orl %ebx, %eax
shrl $1, %edx
addl $1, %ecx
.L2
cmpl $32, %ecx
jl .L1
leave
But must adhere to the following skeleton code:
int f(unsigned int x) {
int val = 0, i = 0;
while(________) {
val = ________________;
x = ________________;
i++;
}
return val;
}
I can tell that the snippet
.L2
cmpl $32, %ecx
jl .L1
can be interpreted as while(i<32). I also know that x is stored in %edx, val in %eax, and i in %ecx. However, I'm having a hard time converting the assembly within the while/.L1 loop into condensed high-level language that fits into the provided skeleton code. For example, can shll, shrl, orl, and andl simply be written using their direct C equivalents (<<,>>,|,&), or is there some more nuance to it?
Is there a standardized guide/"cheat sheet" for Assembly-to-C conversions?
I understand assembly to high-level conversion is not always clear-cut, but there are certainly patterns in assembly code that can be consistently interpreted as certain C operations.
For example, can shll, shrl, orl, and andl simply be written using
their direct C equivalents (<<,>>,|,&), or is there some more nuance
to it?
they can. Let's examine the loop body step-by-step:
shll $1, %eax // shift left eax by 1, same as "eax<<1" or even "eax*=2"
movl %edx, %ebx
andl $1, %ebx // ebx &= 1
orl %ebx, %eax // eax |= ebx
shrl $1, %edx // shift right edx by 1, same as "edx>>1" = "edx/=2"
gets us to
%eax *=2
%ebx = %edx
%ebx = %ebx & 1
%eax |= %ebx
%edx /= 2
ABI tells us (8(%ebp), %edx) that %edx is x, and %eax (return value) is val:
val *=2
%ebx = x // a
%ebx = %ebx & 1 // b
val |= %ebx // c
x /= 2
combine a,b,c: #2 insert a into b:
val *=2
%ebx = (x & 1) // b
val |= %ebx // c
x /= 2
combine a,b,c: #2 insert b into c:
val *=2
val |= (x & 1)
x /= 2
final step: combine both 'val =' into one
val = 2*val | (x & 1)
x /= 2
while (i < 32) { val = (val << 1) | (x & 1); x = x >> 1; i++; } except val and the return value should be unsigned and they aren't in your template. The function returns the bits in x reversed.
The actual answer to your question is more complicated and is pretty much: no there is no such guide and it can't exist because compilation loses information and you can't recreate that lost information from assembler. But you can often make a good educated guess.
My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')
int CountOnes(unsigned int x)
{
unsigned int iPassedNumber = x; // number to be modifed
unsigned int iOriginalNumber = iPassedNumber;
unsigned int iNumbOfOnes = 0;
while (iPassedNumber > 0)
{
iPassedNumber = iPassedNumber >> 1 << 1; //if LSB was '1', it turns to '0'
if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
{
++iNumbOfOnes;
}
iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
iPassedNumber = iOriginalNumber;
}
return (iNumbOfOnes);
}
Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to
{
if(n%2==1)
++CountOnes;
else(n%2==0)
++CountZeros;
}
In the end, I just messed up and don`t know what is better. What do you think about this?
I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.
When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.
I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:
Note
All assembly code was produced using:
gcc -O99 -o foo -fprofile-generate foo.c
followed by
gcc -O99 -o foo -fprofile-use foo.c
On -fprofile-generate
The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however milage may vary based on which version of gcc you may be using.
On with it:
Method I (you)
Here is the disassembly of your function:
CountOnes_you:
.LFB20:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L5
.p2align 4,,10
.p2align 3
.L4:
movl %edi, %edx
xorl %ecx, %ecx
andl $-2, %edx
subl %edx, %edi
cmpl $1, %edi
movl %edx, %edi
sete %cl
addl %ecx, %eax
shrl %edi
jne .L4
rep ret
.p2align 4,,10
.p2align 3
.L5:
rep ret
.cfi_endproc
At a glance
Approximately 9 instructions in a loop, until the loop exits
Method II (teacher)
Here is a function which uses your teacher's algo:
int CountOnes_teacher(unsigned int x)
{
unsigned int one_count = 0;
while(x) {
if(x%2)
++one_count;
x >>= 1;
}
return one_count;
}
Here's the disassembly of that:
CountOnes_teacher:
.LFB21:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L12
.p2align 4,,10
.p2align 3
.L11:
movl %edi, %edx
andl $1, %edx
cmpl $1, %edx
sbbl $-1, %eax
shrl %edi
jne .L11
rep ret
.p2align 4,,10
.p2align 3
.L12:
rep ret
.cfi_endproc
At a glance:
5 instructions in a loop until the loop exits
Method III
Here is Krenighan's method:
int CountOnes_K(unsigned int x) {
unsigned int count;
for(count = 0; ; x; count++) {
x &= x - 1; // clear least sig bit
}
return count;
}
Here's the disassembly:
CountOnes_k:
.LFB22:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L19
.p2align 4,,10
.p2align 3
.L18:
leal -1(%rdi), %edx
addl $1, %eax
andl %edx, %edi
jne .L18 ; loop is here
rep ret
.p2align 4,,10
.p2align 3
.L19:
rep ret
.cfi_endproc
At a glance
3 instructions in a loop.
Some commentary before continuing
As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).
Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!
But all these methods are still rather sub-par because they loop.
If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.
So for instance, if we look at a 14-bit number then we can simply count the bits by:
(n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
uses % but only once for all numbers between 0x0 and 0x3fff
For 24 bits we use 14 bits and then something similar for the remaining 10 bits:
((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted
We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):
CountOnes_best(unsigned int n) {
const unsigned char_bits = sizeof(unsigned char) << 3;
typedef __typeof__(n) T; // T is unsigned int in this case;
n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary
n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
n = (n + (n >> 4)) & (T)~(T)0/255*15;
return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
}
CountOnes_best:
.LFB23:
.cfi_startproc
movl %edi, %eax
shrl %eax
andl $1431655765, %eax
subl %eax, %edi
movl %edi, %edx
shrl $2, %edi
andl $858993459, %edx
andl $858993459, %edi
addl %edx, %edi
movl %edi, %ecx
shrl $4, %ecx
addl %edi, %ecx
andl $252645135, %ecx
imull $16843009, %ecx, %eax
shrl $24, %eax
ret
.cfi_endproc
This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.
The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page
I highly recommend going over the content of that page it really is a fantastic read.
Even better that your teacher's suggestion:
if( n & 1 ) {
++ CountOnes;
}
else {
++ CountZeros;
}
n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.
Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:
unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;
while( i > 0 )
{
if( (n & 1) != 0 )
{
++ones ;
}
else
{
++zeroes ;
}
n >>= 1 ;
}
has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.
The point of this problem is to reverse engineer c code that was made after running the compiler with level 2 optimization. The original c code is as follows (computes the greatest common divisor):
int gcd(int a, int b){
int returnValue = 0;
if (a != 0 && b != 0){
int r;
int flag = 0;
while (flag == 0){
r = a % b;
if (r ==0){
flag = 1;
} else {
a = b;
b = r;
}
}
returnValue = b;
}
return(returnValue);
}
when I ran the optimized compile I ran this from the command line:
gcc -O2 -S Problem04b.c
to get the assembly file for this optimized code
.gcd:
.LFB12:
.cfi_startproc
testl %esi, %esi
je .L2
testl %edi, %edi
je .L2
.L7:
movl %edi, %edx
movl %edi, %eax
movl %esi, %edi
sarl $31, %edx
idivl %esi
testl %edx, %edx
jne .L9
movl %esi, %eax
ret
.p2align 4,,10
.p2align 3
.L2:
xorl %esi, %esi
movl %esi, %eax
ret
.p2align 4,,10
.p2align 3
.L9:
movl %edx, %esi
jmp .L7
.cfi_endproc
I need to convert this assembly code back to c code here is where I am at right now:
int gcd(int a int b){
/*
testl %esi %esi
sets zero flag if a is 0 (ZF) but doesn't store anything
*/
if (a == 0){
/*
xorl %esi %esi
sets the value of a variable to 0. More compact than movl
*/
int returnValue = 0;
/*
movl %esi %eax
ret
return the value just assigned
*/
return(returnValue);
}
/*
testl %edi %edi
sets zero flag if b is 0 (ZF) but doesn't store anything
*/
if (b == 0){
/*
xorl %esi %esi
sets the value of a variable to 0. More compact than movl
*/
int returnValue = 0;
/*
movl %esi %eax
ret
return the value just assigned
*/
return(returnValue);
}
do{
int r = b;
int returnValue = b;
}while();
}
Can anyone help me write this back in to c code? I'm pretty much lost.
First of all, you have the values mixed in your code. %esi begins with the value b and %edi begins with the value a.
You can infer from the testl %edx, %edx line that %edx is used as the condition variable for the loop beginning with .L7 (if %edx is different from 0 then control is transferred to the .L9 block and then returned to .L7). We'll refer to %edx as remainder in our reverse-engineered code.
Let's begin reverse-engineering the main loop:
movl %edi, %edx
Since %edi stores a, this is equivalent to initializing the value of remainder with a: int remainder = a;.
movl %edi, %eax
Store int temp = a;
movl %esi, %edi
Perform int a = b; (remember that %edi is a and %esi is b).
sarl $31, %edx
This arithmetic shift instruction shifts our remainder variable 31 bits to the right whilst maintaining the sign of the number. By shifting 31 bits you're setting remainder to 0 if it's positive (or zero) and to -1 if it's negative. So it's equivalent to remainder = (remainder < 0) ? -1 : 0.
idivl %esi
Divide %edx:%eax by %esi, or in our case, divide remainder * temp by b (the variable). The remainder will be stored in %edx, or in our code, remainder. When combining this with the previous instruction: if remainder < 0 then remainder = -1 * temp % b, and otherwise remainder = temp % b.
testl %edx, %edx
jne .L9
Check to see if remainder is equal to 0 - if it's not, jump to .L9. The code there simply sets b = remainder; before returning to .L7. In order to implement this in C, we'll keep a count variable that will store the amount of times the loop has iterated. We'll perform b = remainder at the beginning of the loop but only after the first iteration, meaning when count != 0.
We're now ready to build our full C loop:
int count = 0;
do {
if (count != 0)
b = remainder;
remainder = a;
temp = a;
a = b;
if (remainder < 0){
remainder = -1 * temp % b;
} else {
remainder = temp % b;
}
count++;
} while (remainder != 0)
And after the loop terminates,
movl %esi, %eax
ret
Will return the GCD that the program computed (in our code it'll be stored in the b variable).
C version:
int arith(int x, int y, int z)
{
int t1 = x+y;
int t2 = z*48;
int t3 = t1 & 0xFFFF;
int t4 = t2 * t3;
return t4;
}
ATT Assembly version of the same program:
x at %ebp+8, y at %ebp+12, z at %ebp+16
movl 16(ebp), %eax
leal (%eax, %eax, 2), %eax
sall $4, %eax // t2 = z* 48... This is where I get confused
movl 12(%ebp), %edx
addl 8(%ebp), %edx
andl $65535, %edx
imull %edx, %eax
I understand everything it is doing at all points of the program besides the shift left.
I assume it is going to shift left 4 times. Why is that?
Thank you!
Edit: I also understand that the part I'm confused on is equivalent to the z*48 part of the C version.
What I'm not understanding is how does shifting left 4 times equate to z*48.
You missed the leal (%eax, %eax, 2), %eax line. Applying some maths the assembly code reads:
a := x
a := a + 2*a // a = 3*x
a := a * 2^4 // a = x * 3*16