Does GCC generate suboptimal code for static branch prediction?

Does GCC generate suboptimal code for static branch prediction? - c

From my university course, I heard, that by convention it is better to place more probable condition in if rather than in else, which may help the static branch predictor. For instance:
if (check_collision(player, enemy)) { // very unlikely to be true
doA();
} else {
doB();
}
may be rewritten as:
if (!check_collision(player, enemy)) {
doB();
} else {
doA();
}
I found a blog post Branch Patterns, Using GCC, which explains this phenomenon in more detail:
Forward branches are generated for if statements. The rationale for
making them not likely to be taken is that the processor can take
advantage of the fact that instructions following the branch
instruction may already be placed in the instruction buffer inside the
Instruction Unit.
next to it, it says (emphasis mine):
When writing an if-else statement, always make the "then" block more
likely to be executed than the else block, so the processor can take
advantage of instructions already placed in the instruction fetch
buffer.
Ultimately, there is article, written by Intel, Branch and Loop Reorganization to Prevent Mispredicts, which summarizes this with two rules:
Static branch prediction is used when there is no data collected by the
microprocessor when it encounters a branch, which is typically the
first time a branch is encountered. The rules are simple:
A forward branch defaults to not taken
A backward branch defaults to taken
In order to effectively write your code to take advantage of these
rules, when writing if-else or switch statements, check the most
common cases first and work progressively down to the least common.
As I understand, the idea is that pipelined CPU may follow the instructions from the instruction cache without breaking it by jumping to another address within code segment. I am aware, though, that this may be largly oversimplified in case modern CPU microarchitectures.
However, it looks like GCC doesn't respect these rules. Given the code:
extern void foo();
extern void bar();
int some_func(int n)
{
if (n) {
foo();
}
else {
bar();
}
return 0;
}
it generates (version 6.3.0 with -O3 -mtune=intel):
some_func:
lea rsp, [rsp-8]
xor eax, eax
test edi, edi
jne .L6 ; here, forward branch if (n) is (conditionally) taken
call bar
xor eax, eax
lea rsp, [rsp+8]
ret
.L6:
call foo
xor eax, eax
lea rsp, [rsp+8]
ret
The only way, that I found to force the desired behavior is by rewriting the if condition using __builtin_expect as follows:
if (__builtin_expect(n, 1)) { // force n condition to be treated as true
so the assembly code would become:
some_func:
lea rsp, [rsp-8]
xor eax, eax
test edi, edi
je .L2 ; here, backward branch is (conditionally) taken
call foo
xor eax, eax
lea rsp, [rsp+8]
ret
.L2:
call bar
xor eax, eax
lea rsp, [rsp+8]
ret

The short answer: no, it is not.
GCC does metrics ton of non trivial optimization and one of them is guessing branch probabilities judging by control flow graph.
According to GCC manual:
fno-guess-branch-probability
Do not guess branch probabilities using
heuristics.
GCC uses heuristics to guess branch probabilities if they are not
provided by profiling feedback (-fprofile-arcs). These heuristics are
based on the control flow graph. If some branch probabilities are
specified by __builtin_expect, then the heuristics are used to guess
branch probabilities for the rest of the control flow graph, taking
the __builtin_expect info into account. The interactions between the
heuristics and __builtin_expect can be complex, and in some cases, it
may be useful to disable the heuristics so that the effects of
__builtin_expect are easier to understand.
-freorder-blocks may swap branches as well.
Also, as OP mentioned the behavior might be overridden with __builtin_expect.
Proof
Look at the following listing.
void doA() { printf("A\n"); }
void doB() { printf("B\n"); }
int check_collision(void* a, void* b)
{ return a == b; }
void some_func (void* player, void* enemy) {
if (check_collision(player, enemy)) {
doA();
} else {
doB();
}
}
int main() {
// warming up gcc statistic
some_func((void*)0x1, NULL);
some_func((void*)0x2, NULL);
some_func((void*)0x3, NULL);
some_func((void*)0x4, NULL);
some_func((void*)0x5, NULL);
some_func(NULL, NULL);
return 0;
}
It is obvious that check_collision will return 0 most of the times. So, the doB() branch is likely and GCC can guess this:
gcc -O main.c -o opt.a
objdump -d opt.a
The asm of some_func is:
sub $0x8,%rsp
cmp %rsi,%rdi
je 6c6 <some_func+0x18>
mov $0x0,%eax
callq 68f <doB>
add $0x8,%rsp
retq
mov $0x0,%eax
callq 67a <doA>
jmp 6c1 <some_func+0x13>
But for sure, we can enforce GCC from being too smart:
gcc -fno-guess-branch-probability main.c -o non-opt.a
objdump -d non-opt.a
And we will get:
push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
mov %rdi,-0x8(%rbp)
mov %rsi,-0x10(%rbp)
mov -0x10(%rbp),%rdx
mov -0x8(%rbp),%rax
mov %rdx,%rsi
mov %rax,%rdi
callq 6a0 <check_collision>
test %eax,%eax
je 6ef <some_func+0x33>
mov $0x0,%eax
callq 67a <doA>
jmp 6f9 <some_func+0x3d>
mov $0x0,%eax
callq 68d <doB>
nop
leaveq
retq
So GCC will leave branches in source order.
I used gcc 7.1.1 for those tests.

I Think That You've Found A "Bug"
The funny thing is that optimization for space and no optimization are the only cases in which the "optimal" instruction code is generated: gcc -S [-O0 | -Os] source.c
some_func:
FB0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
cmpl $0, 8(%ebp)
je L2
call _foo
jmp L3
2:
call _bar
3:
movl $0, %eax
# Or, for -Os:
# xorl %eax, %eax
leave
ret
My point is that ...
some_func:
FB0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
cmpl $0, 8(%ebp)
je L2
call _foo
... up to & through the call to foo everything is "optimal", in the traditional sense, regardless of the exit strategy.
Optimality is ultimately determined by the processor, of course.

Related

alternative to mangling jmp_buf in c for a context switch

In setjmp.h library in linux system jmp_buf is encrypted to decrypt it we use mangle function
*/static long int i64_ptr_mangle(long int p) {
long int ret;
asm(" mov %1, %%rax;\n"
" xor %%fs:0x30, %%rax;"
" rol $0x11, %%rax;"
" mov %%rax, %0;"
: "=r"(ret)
: "r"(p)
: "%rax"
);
return ret;
}
I need to save the context and change the stack pointer, base pointer and program counter in jmp_buffer any alternative to this function that I can use. I am trying to build basic thread library can't head around this. I can't use ucontext.h .

You might as well roll your own version of setjmp/longjmp; even if you reverse engineered that mess, your result will be more fragile than a proper version.
You will need to have a peek at the calling conventions for your environment, but mainly something like:
mov 4(%esp), %eax
mov %ebx, _BX(%eax)
mov %esi, _SI(%eax)
mov %edi, _DI(%eax)
mov %ebp, _BP(%eax)
pushf; pop _FL(%eax)
mov %esp, _SP(%eax)
pop _PC(%eax)
xor %eax,%eax
ret
loadctx:
mov 4(%esp), %edx
mov 8(%esp), %eax
mov _BX(%edx), %ebx
...
push _FL(%edx)
popf
mov _SP(%edx), %esp
jmp _PC(%edx)
Then you define your register layout maybe like:
#define _PC 0
#define _SP 4
#define _FL 8
...
This should work in a dated compiler, like gcc2.x as is. More modern compilers have been, uh, enhanced, to rely on thead local storage(TLS) and the like. You may have to add bits to your context.
Another enhancement is stack checking, typically layered on TLS. Even if you disable stack checking, it is possible that libraries you use will rely on it, so you will have to swap the appropriate entries.

Trouble understanding this assembly code

I have an exam comming up, and I'm strugling with assembly. I have written some simple C code, gotten its assembly code, and then trying to comment on the assembly code as practice. The C code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char const *argv[])
{
int x = 10;
char const* y = argv[1];
printf("%s\n",y );
return 0;
}
Its assembly code:
0x00000000000006a0 <+0>: push %rbp # Creating stack
0x00000000000006a1 <+1>: mov %rsp,%rbp # Saving base of stack into base pointer register
0x00000000000006a4 <+4>: sub $0x20,%rsp # Allocate 32 bytes of space on the stack
0x00000000000006a8 <+8>: mov %edi,-0x14(%rbp) # First argument stored in stackframe
0x00000000000006ab <+11>: mov %rsi,-0x20(%rbp) # Second argument stored in stackframe
0x00000000000006af <+15>: movl $0xa,-0xc(%rbp) # Value 10 stored in x's address in the stackframe
0x00000000000006b6 <+22>: mov -0x20(%rbp),%rax # Second argument stored in return value register
0x00000000000006ba <+26>: mov 0x8(%rax),%rax # ??
0x00000000000006be <+30>: mov %rax,-0x8(%rbp) # ??
0x00000000000006c2 <+34>: mov -0x8(%rbp),%rax # ??
0x00000000000006c6 <+38>: mov %rax,%rdi # Return value copied to 1st argument register - why??
0x00000000000006c9 <+41>: callq 0x560 # printf??
0x00000000000006ce <+46>: mov $0x0,%eax # Value 0 is copied to return register
0x00000000000006d3 <+51>: leaveq # Destroying stackframe
0x00000000000006d4 <+52>: retq # Popping return address, and setting instruction pointer equal to it
Can a friendly soul help me out wherever I have "??" (meaning I don't understand what is happening or I'm unsure)?

0x00000000000006ba <+26>: mov 0x8(%rax),%rax # get argv[1] to rax
0x00000000000006be <+30>: mov %rax,-0x8(%rbp) # move argv[1] to local variable
0x00000000000006c2 <+34>: mov -0x8(%rbp),%rax # move local variable to rax (for move to rdi)
0x00000000000006c6 <+38>: mov %rax,%rdi # now rdi has argv[1]
0x00000000000006c9 <+41>: callq 0x560 # it is puts (optimized)

I will try to make a guess:
mov -0x20(%rbp),%rax # retrieve argv[0]
mov 0x8(%rax),%rax # store argv[1] into rax
mov %rax,-0x8(%rbp) # store argv[1] (which now is in rax) into y
mov -0x8(%rbp),%rax # put y back into rax (which might look dumb, but possibly it has its reasons)
mov %rax,%rdi # copy y to rdi, possibly to prepare the context for the printf
When you deal with assembler, please specify which architecture you are using. An Intel processor might use a different set of instructions from an ARM one, the same instructions might be different or they might rely on different assumptions. As you might know, optimisations change the sequence of assembler instructions generated by the compiler, you might want to specify whether you are using that as well (looks like not?) and which compiler you are using as everyone has its own policy for generating assembler.
Maybe we will never know why the compiler must prepare the context for printf by copying from rax, it could be a compiler's choice or an obligation imposed by the specific architecture. For all those annoying reasons, most of people prefer to use a "high level language" such as C, so that the set of instructions is always right although it might look very dumb for a human (as we know computers are dumb by design) and not always the most choice, that's why there are still many compilers around.
I can give you two more tips:
you IDE must have a way to interleave assembler instructions with C code, and to single step within the assembler. Try to find it out and explore it yourself
the IDE should also have a function to explore the memory of your program. If you find that try to enter the 0x560 address and look were it will lead you. It is very likely that that will be the entry point of your printf
I hope that my answer will help you work it out, good luck

Difference in for loops of old and new GCC's generated assembly code

I am reading a chapter about assembly code, which has an example. Here is the C program:
int main()
{
int i;
for(i=0; i < 10; i++)
{
puts("Hello, world!\n");
}
return 0;
}
Here is the assembly code provided in the book:
0x08048384 <main+0>: push ebp
0x08048385 <main+1>: mov ebp,esp
0x08048387 <main+3>: sub esp,0x8
0x0804838a <main+6>: and esp,0xfffffff0
0x0804838d <main+9>: mov eax,0x0
0x08048392 <main+14>: sub esp,eax
0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>: jle 0x80483a3 <main+31>
0x080483a1 <main+29>: jmp 0x80483b6 <main+50>
0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>: call 0x80482a8 <_init+56>
0x080483af <main+43>: lea eax,[ebp-4]
0x080483b2 <main+46>: inc DWORD PTR [eax]
0x080483b4 <main+48>: jmp 0x804839b <main+23>
Here is part of my version:
0x0000000000400538 <+8>: mov DWORD PTR [rbp-0x4],0x0
=> 0x000000000040053f <+15>: jmp 0x40054f <main+31>
0x0000000000400541 <+17>: mov edi,0x4005f0
0x0000000000400546 <+22>: call 0x400410 <puts#plt>
0x000000000040054b <+27>: add DWORD PTR [rbp-0x4],0x1
0x000000000040054f <+31>: cmp DWORD PTR [rbp-0x4],0x9
0x0000000000400553 <+35>: jle 0x400541 <main+17>
My question is, why is in case of the book's version it assigns 0 to the variable(mov DWORD PTR [ebp-4],0x0) and compares just after that with cmp but in my version, it assigns and then it does jmp 0x40054f <main+31> where the cmp is?
It seems more logical to assign and compare without any jump, because it is like that inside for loop.

Why did your compiler do something different than a different compiler that was used in the book? Because it's a different compiler. No two compilers will compile all code the same, even very trivial code can be compiled vastly different by two different compilers or even two versions of the same compiler. And it's quite obvious both were compiled without any optimization, with optimization the results would be even more different.
Let's reason about what the for loop does.
for (i = 0; i < 10; i++) {
code;
}
Let's write it a little bit closer to the assembler that was generated by the first compiler generated.
i = 0;
start: if (i > 9) goto out;
code;
i++;
goto start;
out:
Now the same thing for "my version":
i = 0;
goto cmp;
start: code;
i++;
cmp: if (i < 10) goto start;
The clear difference here is that in "my version" there will only be one jump executed within the loop while the book version has two. It's a quite common way to generate loops in more modern compilers because of how sensitive CPUs are to branches. Many compilers will generate code like this even without any optimizations because it performs better in most cases. Older compilers didn't do this because either they didn't think about it or this trick was performed in an optimization stage which wasn't enabled when compiling the code in the book.
Notice that a compiler with any kind of optimization enabled wouldn't even do that first goto cmp because it would know that it was unnecessary. Try compiling your code with optimization enabled (you say you use gcc, give it the -O2 flag) and see how vastly different it will look after that.

You didn't quote the full assembly-language body of the function from your textbook, but my psychic powers tell me that it looked something like this (also, I've replaced literal addresses with labels, for clarity):
# ... establish stack frame ...
mov DWORD PTR [rbp-4],0x0
cmp DWORD PTR [rbp-4],0x9
jle .L0
.L1:
mov rdi, .Lconst0
call puts
add DWORD PTR [rbp-0x4],0x1
cmp DWORD PTR [rbp-0x4],0x9
jle .L1
.L0:
# ... return from function ...
GCC has noticed that it can eliminate the initial cmp and jle by replacing them with an unconditional jmp down to the cmp at the bottom of the loop, so that is what it did. This is a standard optimization called loop inversion. Apparently it does this even with the optimizer off; with optimization on, it would also have noticed that the initial comparison must be false, hoisted out the address load, placed the loop index in a register, and converted to a count-down loop so it could eliminate the cmp altogether; something like this:
# ... establish stack frame ...
mov ebx, 10
mov r14, .Lconst0
.L1:
mov rdi, r14
call puts
dec ebx
jne .L1
# ... return from function ...
(The above was actually generated by Clang. My version of GCC did something else, equally sensible but harder to explain.)

gcc inline assembly - operand type mismatch for `add', trying to create branchless code

I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?

Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.

Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).

You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO

Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.

How come my array index is faster than pointer

Why the array index is faster than pointer?
Isn't pointer supposed to be faster than array index?
** i used time.h clock_t to tested two functions, each loop 2 million times.
Pointer time : 0.018995
Index time : 0.017864
void myPointer(int a[], int size)
{
int *p;
for(p = a; p < &a[size]; p++)
{
*p = 0;
}
}
void myIndex(int a[], int size)
{
int i;
for(i = 0; i < size; i++)
{
a[i] = 0;
}
}

No, never ever pointers are supposed to be faster than array index. If one of the code is faster than the other, it's mostly because some address computations might be different. The question also should provide information of compiler and optimization flags as it can heavily affect the performance.
Array index in your context (array bound is not known) is exactly identical to the pointer operation. From a viewpoint of compilers, it is just different expression of pointer arithmetic. Here is an example of an optimized x86 code in Visual Studio 2010 with full optimization and no inline.
3: void myPointer(int a[], int size)
4: {
013E1800 push edi
013E1801 mov edi,ecx
5: int *p;
6: for(p = a; p < &a[size]; p++)
013E1803 lea ecx,[edi+eax*4]
013E1806 cmp edi,ecx
013E1808 jae myPointer+15h (13E1815h)
013E180A sub ecx,edi
013E180C dec ecx
013E180D shr ecx,2
013E1810 inc ecx
013E1811 xor eax,eax
013E1813 rep stos dword ptr es:[edi]
013E1815 pop edi
7: {
8: *p = 0;
9: }
10: }
013E1816 ret
13: void myIndex(int a[], int size)
14: {
15: int i;
16: for(i = 0; i < size; i++)
013E17F0 test ecx,ecx
013E17F2 jle myIndex+0Ch (13E17FCh)
013E17F4 push edi
013E17F5 xor eax,eax
013E17F7 mov edi,edx
013E17F9 rep stos dword ptr es:[edi]
013E17FB pop edi
17: {
18: a[i] = 0;
19: }
20: }
013E17FC ret
At a glance, myIndex looks faster because the number of instructions are less, however, the two pieces of the code are essentially the same. Both eventually use rep stos, which is a x86's repeating (loop) instruction. The only difference is because of the computation of the loop bound. The for loop in myIndex has the trip count size as it is (i.e., no computation is needed). But, myPointer needs some computation to get the trip count of the for loop. This is the only difference. The important loop operations are just the same. Thus, the difference is negligible.
To summarize, the performance of myPointer and myIndex in an optimized code should be identical.
FYI, if the array's bound is known at compile time, e.g., int A[constant_expression], then the accesses on this array may be much faster than the pointer one. This is mostly because the array accesses are free from the pointer analysis problem. Compilers can perfectly compute the dependency information on computations and accesses on a fixed-size array, so it can do advanced optimizations including automatic parallelization.
However, if computations are pointer based, compilers must perform pointer analysis for further optimization, which is pretty much limited in C/C++. It generally ends up with conservative results on pointer analysis and results in a few optimization opportunity.

Array dereference p[i] is *(p + i). Compilers make use of instructions that do math + dereference in 1 or 2 cycles (e.g. x86 LEA instruction) to optimize for speed.
With the pointer loop, it splits the access and offset into to separate parts and the compiler cannot optimize it.

It may be the comparison in the for loop that is causing the difference. The termination condition is tested on each iteration, and your "pointer" example has a slightly more complicated termination condition (taking the address of &a[size]). Since &a[size] does not change, you could try setting it to a variable to avoid recalculating it on each iteration of the loop.

I would suggest running each loop 200 million times, and then run each loop 10 times, and take the fastest measurement. That will factor out effects from OS scheduling and so on.
I would then suggest you disassemble the code for each loop.

Oops, on my 64-bit system results are quite different. I've got that this
int i;
for(i = 0; i < size; i++)
{
*(a+i) = 0;
}
is about 100 times !! slower than this
int i;
int * p = a;
for(i = 0; i < size; i++)
{
*(p++) = 0;
}
when compiling with -O3. This hints to me that somehow moving to next address is far easier to achieve for 64-bit cpu, than to calculate destination address from some offset. But i'm not sure.
EDIT: This really has something related with 64-bit architecture because same code with same compile flags doesn't shows any real difference in performance on 32-bit system.

Compiler optimizations are pattern matching.
When your compiler optimizes, it looks for known code patterns, and then transforms the code according to some rule. Your two code snippets seem to trigger different transforms, and thus produce slightly different code.
This is one of the reasons why we always insist to actually measure the resulting performance when it comes to optimizations: You can never be sure what your compiler turns your code into unless you test it.
If you are really curious, try compiling your code with gcc -S -Os, this produces the most readable, yet optimized assembler code. On your two functions, I get the following assembler with that:
pointer code:
.L2:
cmpq %rax, %rdi
jnb .L5
movl $0, (%rdi)
addq $4, %rdi
jmp .L2
.L5:
index code:
.L7:
cmpl %eax, %esi
jle .L9
movl $0, (%rdi,%rax,4)
incq %rax
jmp .L7
.L9:
The differences are slight, but may indeed trigger a performance difference, most importantly the difference between using addq and incq could be significant.

The times are so close together that if you did them repeatedly, you may not see much of a difference. Both code segments compile to the exact same assembly. By definition, there is no difference.

It looks like the index solution can save a few instructions with the compare in the for loop.

Access the data through array index or pointer is exactly equivalent. Go through the below program with me...
There are a loop which continues to 100 times but when we see disassemble code that there are the data which we access through has least instruction comparability to access through array Index
But it doesn't mean that accessing data through pointer is fast actually it's depend on the instruction which performed by compiler.Both the pointer and array index used the address array access the value from offset and increment through it and pointer has address.
int a[100];
fun1(a,100);
fun2(&a[0],5);
}
void fun1(int a[],int n)
{
int i;
for(i=0;i<=99;i++)
{
a[i]=0;
printf("%d\n",a[i]);
}
}
void fun2(int *p,int n)
{
int i;
for(i=0;i<=99;i++)
{
*p=0;
printf("%d\n",*(p+i));
}
}
disass fun1
Dump of assembler code for function fun1:
0x0804841a <+0>: push %ebp
0x0804841b <+1>: mov %esp,%ebp
0x0804841d <+3>: sub $0x28,%esp`enter code here`
0x08048420 <+6>: movl $0x0,-0xc(%ebp)
0x08048427 <+13>: jmp 0x8048458 <fun1+62>
0x08048429 <+15>: mov -0xc(%ebp),%eax
0x0804842c <+18>: shl $0x2,%eax
0x0804842f <+21>: add 0x8(%ebp),%eax
0x08048432 <+24>: movl $0x0,(%eax)
0x08048438 <+30>: mov -0xc(%ebp),%eax
0x0804843b <+33>: shl $0x2,%eax
0x0804843e <+36>: add 0x8(%ebp),%eax
0x08048441 <+39>: mov (%eax),%edx
0x08048443 <+41>: mov $0x8048570,%eax
0x08048448 <+46>: mov %edx,0x4(%esp)
0x0804844c <+50>: mov %eax,(%esp)
0x0804844f <+53>: call 0x8048300 <printf#plt>
0x08048454 <+58>: addl $0x1,-0xc(%ebp)
0x08048458 <+62>: cmpl $0x63,-0xc(%ebp)
0x0804845c <+66>: jle 0x8048429 <fun1+15>
0x0804845e <+68>: leave
0x0804845f <+69>: ret
End of assembler dump.
(gdb) disass fun2
Dump of assembler code for function fun2:
0x08048460 <+0>: push %ebp
0x08048461 <+1>: mov %esp,%ebp
0x08048463 <+3>: sub $0x28,%esp
0x08048466 <+6>: movl $0x0,-0xc(%ebp)
0x0804846d <+13>: jmp 0x8048498 <fun2+56>
0x0804846f <+15>: mov 0x8(%ebp),%eax
0x08048472 <+18>: movl $0x0,(%eax)
0x08048478 <+24>: mov -0xc(%ebp),%eax
0x0804847b <+27>: shl $0x2,%eax
0x0804847e <+30>: add 0x8(%ebp),%eax
0x08048481 <+33>: mov (%eax),%edx
0x08048483 <+35>: mov $0x8048570,%eax
0x08048488 <+40>: mov %edx,0x4(%esp)
0x0804848c <+44>: mov %eax,(%esp)
0x0804848f <+47>: call 0x8048300 <printf#plt>
0x08048494 <+52>: addl $0x1,-0xc(%ebp)
0x08048498 <+56>: cmpl $0x63,-0xc(%ebp)
0x0804849c <+60>: jle 0x804846f <fun2+15>
0x0804849e <+62>: leave
0x0804849f <+63>: ret
End of assembler dump.
(gdb)

This is a very hard thing to time, because compilers are very good at optimising these things. Still it's better to give the compiler as much information as possible, that's why in this case I'd advise using std::fill, and let the compiler choose.
But... If you want to get into the detail
a) CPU's normally give pointer+value for free, like : mov r1, r2(r3).
b) This means an index operation requires just : mul r3,r1,size
This is just one cycle extra, per loop.
c) CPU's often provide stall/delay slots, meaning you can often hide single-cycle operations.
All in all, even if your loops are very large, the cost of the access is nothing compared to the cost of even a few cache-misses. You are best advised to optimise your structures before you care about loop costs. Try for example, packing your structures to reduce the memory footprint first