GCC optimization missed opportunity - c

I'm compiling this C code:
int mode; // use aa if true, else bb
int aa[2];
int bb[2];
inline int auto0() { return mode ? aa[0] : bb[0]; }
inline int auto1() { return mode ? aa[1] : bb[1]; }
int slow() { return auto1() - auto0(); }
int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }
Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx
testl %ecx, %ecx
je .L10
movl aa+4(%rip), %eax
movl aa(%rip), %edx
subl %edx, %eax
ret
.L10:
movl bb+4(%rip), %eax
movl bb(%rip), %edx
subl %edx, %eax
ret
fast():
movl mode(%rip), %esi
testl %esi, %esi
jne .L18
movl bb+4(%rip), %eax
subl bb(%rip), %eax
ret
.L18:
movl aa+4(%rip), %eax
subl aa(%rip), %eax
ret
Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa#GOTPCREL(%rip), %rax ; rax = aa (both ints at once)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
movq _mode#GOTPCREL(%rip), %rdx ; rdx = mode
cmpl $0, (%rdx) ; mode == 0 ?
leaq 4(%rcx), %rdx ; rdx = bb[1]
cmovneq %rax, %rcx ; if (mode != 0) rcx = aa
leaq 4(%rax), %rax ; rax = aa[1]
cmoveq %rdx, %rax ; if (mode == 0) rax = bb
movl (%rax), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
fast():
movq _mode#GOTPCREL(%rip), %rax ; rax = mode
cmpl $0, (%rax) ; mode == 0 ?
je LBB1_2 ; if (mode != 0) {
movq _aa#GOTPCREL(%rip), %rcx ; rcx = aa
jmp LBB1_3 ; } else {
LBB1_2: ; // (mode == 0)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
LBB1_3: ; }
movl 4(%rcx), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.

The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow ()
{
int D.1405;
int mode.3;
int D.1402;
int D.1379;
# BLOCK 2 freq:10000
mode.3_5 = mode;
if (mode.3_5 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:5000
D.1402_6 = aa[1];
D.1405_10 = aa[0];
goto <bb 5>;
# BLOCK 4 freq:5000
D.1402_7 = bb[1];
D.1405_11 = bb[0];
# BLOCK 5 freq:10000
D.1379_3 = D.1402_17 - D.1405_12;
return D.1379_3;
}
whereas in fast() they are in the same basic block:
fast ()
{
int D.1377;
int D.1376;
int D.1374;
int D.1373;
int mode.1;
int D.1368;
# BLOCK 2 freq:10000
mode.1_2 = mode;
if (mode.1_2 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:3900
D.1373_3 = aa[1];
D.1374_4 = aa[0];
D.1368_5 = D.1373_3 - D.1374_4;
goto <bb 5>;
# BLOCK 4 freq:6100
D.1376_6 = bb[1];
D.1377_7 = bb[0];
D.1368_8 = D.1376_6 - D.1377_7;
# BLOCK 5 freq:10000
return D.1368_1;
}
GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).

I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:
inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; }
inline int kwiky(int *xx) { return xx[1] - xx[0]; }
int kwik() { return kwiky(xx()); }
When compiled by GCC 4.7 with -O3:
movl mode, %edx
xorl %eax, %eax
testl %edx, %edx
setne %al
movl xx.1369(,%eax,4), %edx
movl 4(%edx), %eax
subl (%edx), %eax
ret
With the definition of xx(), you can redefine auto0() and auto1() like so:
inline int auto0() { return xx()[0]; }
inline int auto1() { return xx()[1]; }
And, from this, you should see that slow() now compiles into code similar or identical to kwik().

Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).
Some of them control code reduction/deduplication.
For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .

Related

Counting '1' in number in C

My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')
int CountOnes(unsigned int x)
{
unsigned int iPassedNumber = x; // number to be modifed
unsigned int iOriginalNumber = iPassedNumber;
unsigned int iNumbOfOnes = 0;
while (iPassedNumber > 0)
{
iPassedNumber = iPassedNumber >> 1 << 1; //if LSB was '1', it turns to '0'
if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
{
++iNumbOfOnes;
}
iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
iPassedNumber = iOriginalNumber;
}
return (iNumbOfOnes);
}
Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to
{
if(n%2==1)
++CountOnes;
else(n%2==0)
++CountZeros;
}
In the end, I just messed up and don`t know what is better. What do you think about this?
I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.
When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.
I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:
Note
All assembly code was produced using:
gcc -O99 -o foo -fprofile-generate foo.c
followed by
gcc -O99 -o foo -fprofile-use foo.c
On -fprofile-generate
The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however milage may vary based on which version of gcc you may be using.
On with it:
Method I (you)
Here is the disassembly of your function:
CountOnes_you:
.LFB20:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L5
.p2align 4,,10
.p2align 3
.L4:
movl %edi, %edx
xorl %ecx, %ecx
andl $-2, %edx
subl %edx, %edi
cmpl $1, %edi
movl %edx, %edi
sete %cl
addl %ecx, %eax
shrl %edi
jne .L4
rep ret
.p2align 4,,10
.p2align 3
.L5:
rep ret
.cfi_endproc
At a glance
Approximately 9 instructions in a loop, until the loop exits
Method II (teacher)
Here is a function which uses your teacher's algo:
int CountOnes_teacher(unsigned int x)
{
unsigned int one_count = 0;
while(x) {
if(x%2)
++one_count;
x >>= 1;
}
return one_count;
}
Here's the disassembly of that:
CountOnes_teacher:
.LFB21:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L12
.p2align 4,,10
.p2align 3
.L11:
movl %edi, %edx
andl $1, %edx
cmpl $1, %edx
sbbl $-1, %eax
shrl %edi
jne .L11
rep ret
.p2align 4,,10
.p2align 3
.L12:
rep ret
.cfi_endproc
At a glance:
5 instructions in a loop until the loop exits
Method III
Here is Krenighan's method:
int CountOnes_K(unsigned int x) {
unsigned int count;
for(count = 0; ; x; count++) {
x &= x - 1; // clear least sig bit
}
return count;
}
Here's the disassembly:
CountOnes_k:
.LFB22:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L19
.p2align 4,,10
.p2align 3
.L18:
leal -1(%rdi), %edx
addl $1, %eax
andl %edx, %edi
jne .L18 ; loop is here
rep ret
.p2align 4,,10
.p2align 3
.L19:
rep ret
.cfi_endproc
At a glance
3 instructions in a loop.
Some commentary before continuing
As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).
Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!
But all these methods are still rather sub-par because they loop.
If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.
So for instance, if we look at a 14-bit number then we can simply count the bits by:
(n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
uses % but only once for all numbers between 0x0 and 0x3fff
For 24 bits we use 14 bits and then something similar for the remaining 10 bits:
((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted
We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):
CountOnes_best(unsigned int n) {
const unsigned char_bits = sizeof(unsigned char) << 3;
typedef __typeof__(n) T; // T is unsigned int in this case;
n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary
n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
n = (n + (n >> 4)) & (T)~(T)0/255*15;
return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
}
CountOnes_best:
.LFB23:
.cfi_startproc
movl %edi, %eax
shrl %eax
andl $1431655765, %eax
subl %eax, %edi
movl %edi, %edx
shrl $2, %edi
andl $858993459, %edx
andl $858993459, %edi
addl %edx, %edi
movl %edi, %ecx
shrl $4, %ecx
addl %edi, %ecx
andl $252645135, %ecx
imull $16843009, %ecx, %eax
shrl $24, %eax
ret
.cfi_endproc
This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.
The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page
I highly recommend going over the content of that page it really is a fantastic read.
Even better that your teacher's suggestion:
if( n & 1 ) {
++ CountOnes;
}
else {
++ CountZeros;
}
n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.
Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:
unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;
while( i > 0 )
{
if( (n & 1) != 0 )
{
++ones ;
}
else
{
++zeroes ;
}
n >>= 1 ;
}
has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.

Why -O1 is faster than -O2 for 10000 times?

Below is a C function to evaluate a polynomial:
/* Calculate a0 + a1*x + a2*x^2 + ... + an*x^n */
/* from CSAPP Ex.5.5, modified to integer version */
int poly(int a[], int x, int degree) {
long int i;
int result = a[0];
int xpwr = x;
for (i = 1; i <= degree; ++i) {
result += a[i]*xpwr;
xpwr *= x;
}
return result;
}
And a main function:
#define TIMES 100000ll
int main(void) {
long long int i;
unsigned long long int result = 0;
for (i = 0; i < TIMES; ++i) {
/* g_a is an int[10000] global variable with all elements equals to 1 */
/* x = 2, i.e. evaluate 1 + 2 + 2^2 + ... + 2^9999 */
result += poly(g_a, 2, 9999);
}
printf("%lld\n", result);
return 0;
}
When I compile the program with GCC and options -O1 and -O2 separately, I found that -O1 is FASTER than -O2 a lot.
Platform details:
i5-4600
Arch Linux x86_64 with kernel 3.18
GCC 4.9.2
gcc -O1 -o /tmp/a.out test.c
gcc -O2 -o /tmp/a.out test.c
Result:
When TIMES = 100000ll, -O1 prints the result instantly, while -O2 needs 0.36s
When TIMES = 1000000000ll, -O1 prints the result in 0.28s, -O2 takes so long that I didn't finish the test
It seems that -O1 is approximately 10000 times faster than -O2.
When I test it on Mac (clang-600.0.56), the result is even more weird: -O1 takes no more than 0.02s even when TIMES = 1000000000000000000ll
I have tested the following changes:
makes g_a random (elements are from 1 to 10)
x = 19234 (or some other number)
use int instead of long long int
And the results are the same.
I tried to look at the assembly code, it seems that -O1 is calling the poly function while -O2 does inline optimization. But inline should make the performance better, isn't it?
What makes these huge differences? Why -O1 on clang can make the program so fast? Is -O1 doing something wrong? (I cannot check the result as it is too slow without optimization)
Here is the assembly code of main for -O1: (you may get it by adding -S option to gcc)
main:
.LFB12:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $9999, %edx
movl $2, %esi
movl $g_a, %edi
call poly
movslq %eax, %rdx
movl $100000, %eax
.L6:
subq $1, %rax
jne .L6
imulq $100000, %rdx, %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
And for -O2:
main:
.LFB12:
.cfi_startproc
movl g_a(%rip), %r9d
movl $100000, %r8d
xorl %esi, %esi
.p2align 4,,10
.p2align 3
.L8:
movl $g_a+4, %eax
movl %r9d, %ecx
movl $2, %edx
.p2align 4,,10
.p2align 3
.L7:
movl (%rax), %edi
addq $4, %rax
imull %edx, %edi
addl %edx, %edx
addl %edi, %ecx
cmpq $g_a+40000, %rax
jne .L7
movslq %ecx, %rcx
addq %rcx, %rsi
subq $1, %r8
jne .L8
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC1, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
Although I don't know much about assembly, it is obvious that -O1 is just calling poly once, and multiply the result by 100000 (imulq $100000, %rdx, %rsi). This is the reason that it is so fast.
It seems that gcc can detect that poly is a pure function with no side effect. (It will be interesting if we have another thread modifying g_a while poly is running...)
On the other hand, -O2 has inlined the poly function, so it has no chance to check poly as a pure function.
I have further done some research:
I cannot find the actual flag used by -O1 which do the pure function checking.
I have tried all the flags listed by gcc -Q -O1 --help=optimizers individually, but none of them have the effect.
Maybe it needs a combination of the flags together to get the effect, but it is very hard to try all the combinations.
But I have found the flag used by -O2 which makes the effect disappear, which is the -finline-small-functions flag. The name of the flag explains itself.
One thing that jumps out at me is that you're overflowing signed integers. The behaviour of this is undefined in C. Specifically, int result won't be able to hold pow(2,9999). I don't see what the point is of benchmarking code with undefined behaviour?

Faster technique for integer equality in C

I am writing a function which checks if two integers are same .I wrote it in two different manners.I want to know if there is any performance difference
Technique 1
int checkEqual(int a ,int b)
{
if (a == b)
return 1; //it means they were equal
else
return 0;
}
Technique 2
int checkEqual(int a ,int b)
{
if (!(a - b))
return 1; //it means they are equal
else
return 0;
}
In short, there is no difference of performance.
I compiled each techniques using gcc-4.8.2 with -O2 -S option (-S generates assembly codes)
Technique 1
checkEqual1:
.LFB24:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
Technique 2
checkEqual2:
.LFB25:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
These are exactly the same assembly code.
So these two codes will provide the same performance.
Appendix
bool checkEquals3(int a, int b) { return a == b; }
provides
checkEqual3:
.LFB26:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
exactly the same assembly code too!
It doesn't make any sense whatsoever to discuss manual code optimization without a specific system in mind.
That being said, you should always leave optimizations like these to the compiler and focus on writing as readable code as possible.
Your code can be made more readable by using only one return statement. Also, indent your code.
int checkEqual (int a, int b)
{
return a == b;
}

Why is my more complicated C loop faster?

I am looking at the performance of memchr-like functions and made an interesting observation.
This is check.c with 3 implementations to find the offset of a \n character in a string:
#include <stdlib.h>
size_t mem1(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n') return (p - s);
p++;
}
}
size_t mem2(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
p++;
}
}
size_t mem3(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n' || x == '\0') return (p - s);
p++;
}
}
size_t mem4(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n')) return (p - s);
p++;
}
}
I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello" - that is 10000 times asdf, then the newline to find, and then hello. Of course all 3 implementations return the same offset: 40000 as expected.
Interestingly, when compiled with gcc -O2, the run times on that string are:
mem1: 16 us
mem2: 12 us
mem3: 25 us
mem4: 16 us
(I'm using the criterion library to measure these times with statistical accuracy.)
I cannot explain this to myself. Why is mem2 so much faster than the other two?
--
The assembly as generated by gcc -S -O2 -o check.asm check.c:
mem1:
.LFB14:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L9
.L6:
addq $1, %rax
cmpb $10, (%rax)
jne .L6
subq %rdi, %rax
ret
.L9:
xorl %eax, %eax
ret
mem2:
.LFB15:
movq %rdi, %rax
jmp .L13
.L19:
cmpb $10, %dl
je .L14
.L11:
addq $1, %rax
.L13:
movzbl (%rax), %edx
cmpb $36, %dl
jg .L11
testb %dl, %dl
jne .L19
.L14:
subq %rdi, %rax
ret
mem3:
.LFB16:
movzbl (%rdi), %edx
testb %dl, %dl
je .L26
cmpb $10, %dl
movq %rdi, %rax
jne .L27
jmp .L26
.L30:
cmpb $10, %dl
je .L23
.L27:
addq $1, %rax
movzbl (%rax), %edx
testb %dl, %dl
jne .L30
.L23:
subq %rdi, %rax
ret
.L26:
xorl %eax, %eax
ret
mem4:
.LFB17:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L38
.L36:
addq $1, %rax
cmpb $10, (%rax)
jne .L36
subq %rdi, %rax
ret
.L38:
xorl %eax, %eax
ret
Any explanation is very appreciated!
My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1, you have a circular dependency on rax. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.
In mem2 the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax and edx/dl - is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.
I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.
Edit: Out of interest, I've tested mem1 and mem2 on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time, I get ~19s for mem1 and ~18.8s for mem2 - much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...
Your input is such that makes mem2 faster. Every letter in the input apart from '\n' has value larger than '$', so if condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.
With a cache, the test of mem1() takes the brunt of filling the cache.
Run the mem1() test first and again as last and use the 2nd time as it reflects a primed cache like the other tests. Confident it will be faster and a more fair time comparison.

Assembly language to C

So I have the following assembly language code which I need to convert into C. I am confused on a few lines of the code.
I understand that this is a for loop. I have added my comments on each line.
I think the for loop goes like this
for (int i = 1; i > 0; i << what?) {
//Calculate result
}
What is the test condition? And how do I change it?
Looking at the assembly code, what does the variable 'n' do?
This is Intel x86 so the format is movl = source, dest
movl 8(%ebp), %esi //Get x
movl 12(%ebp), %ebx //Get n
movl $-1, %edi //This should be result
movl $1, %edx //The i of the loop
.L2:
movl %edx, %eax
andl %esi, %eax
xorl %eax, %edi //result = result ^ (i & x)
movl %ebx, %ecx //Why do we do this? As we never use $%ebx or %ecx again
sall %cl, %edx //Where did %cl come from?
testl %edx, %edx //Tests if i != what? - condition of the for loop
jne .L2 //Loop again
movl %edi, %eax //Otherwise return result.
sall %cl, %edx shifts %edx left by %cl bits. (%cl, for reference, is the low byte of %ecx.) The subsequent testl tests whether that shift zeroed out %edx.
The jne is called that because it's often used in the context of comparisons, which in ASM are often just subtractions. The flags would be set based on the difference; ZF would be set if the items are equal (since x - x == 0). It's also called jnz in Intel syntax; i'm not sure whether GNU allows that too.
All together, the three instructions translate to i <<= n; if (i != 0) goto L2;. That plus the label seem to make a for loop.
for (i = 1; i != 0; i <<= n) { result ^= i & x; }
Or, more correctly (but achieving the same goal), a do...while loop.
i = 1;
do { result ^= i & x; i <<= n; } while (i != 0);

Resources