Why are more x86 instructions faster than less? [duplicate] - c

This question already has answers here:
Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?
(3 answers)
How to load a single byte from address in assembly
(1 answer)
Why doesn't GCC use partial registers?
(3 answers)
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
(2 answers)
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
Closed 1 year ago.
So I've been reading up about the what goes on inside x86 processors for about half a year now. So I decided to try my hand at x86 assembly for fun, starting only with 80386 instructions to keep it simple. (I'm trying to learn mostly, not optimize)
I have a game I made a few months ago coded in C, so I went there and rewrote the bitmap blitting function from scratch with assembly code. What I don't get is that the main pixel plotting body of the loop is faster with the C code (which is 18 instructions) than my assembly code (which is only 7 instructions, and I'm almost 100% certain it doesn't straddle cache line boundaries).
So my main question is why do 18 instructions take less time than the 7 instructions?
At the bottom I have the 2 code snippets.
PS. Each color is 8 bit indexed.
C Code:
{
for (x = 0; x < src.w; x++)
00D35712 mov dword ptr [x],0 // Just initial loop setup
00D35719 jmp Renderer_DrawBitmap+174h (0D35724h) // Just initial loop setup
00D3571B mov eax,dword ptr [x]
00D3571E add eax,1
00D35721 mov dword ptr [x],eax
00D35724 mov eax,dword ptr [x]
00D35727 cmp eax,dword ptr [ebp-28h]
00D3572A jge Renderer_DrawBitmap+1BCh (0D3576Ch)
{
*dest_pixel = renderer_trans[renderer_light[*src_pixel][light]][*dest_pixel][trans];
// Start of what I consider the body
00D3572C mov eax,dword ptr [src_pixel]
00D3572F movzx ecx,byte ptr [eax]
00D35732 mov edx,dword ptr [light]
00D35735 movzx eax,byte ptr renderer_light (0EDA650h)[edx+ecx*8]
00D3573D shl eax,0Bh
00D35740 mov ecx,dword ptr [dest_pixel]
00D35743 movzx edx,byte ptr [ecx]
00D35746 lea eax,renderer_trans (0E5A650h)[eax+edx*8]
00D3574D mov ecx,dword ptr [dest_pixel]
00D35750 mov edx,dword ptr [trans]
00D35753 mov al,byte ptr [eax+edx]
00D35756 mov byte ptr [ecx],al
dest_pixel++;
00D35758 mov eax,dword ptr [dest_pixel]
00D3575B add eax,1
00D3575E mov dword ptr [dest_pixel],eax
src_pixel++;
00D35761 mov eax,dword ptr [src_pixel]
00D35764 add eax,1
00D35767 mov dword ptr [src_pixel],eax
// End of what I consider the body
}
00D3576A jmp Renderer_DrawBitmap+16Bh (0D3571Bh)
And the assembly code I wrote:
(esi is the source pixel, edi is the screen buffer, edx is the light level, ebx is the transparency level, and ecx is the width of this row)
drawing_loop:
00C55682 movzx ax,byte ptr [esi]
00C55686 mov ah,byte ptr renderer_light (0DFA650h)[edx+eax*8]
00C5568D mov al,byte ptr [edi]
00C5568F mov al,byte ptr renderer_trans (0D7A650h)[ebx+eax*8]
00C55696 mov byte ptr [edi],al
00C55698 inc esi
00C55699 inc edi
00C5569A loop drawing_loop (0C55682h)
// This isn't just the body this is the full row plotting loop just like the code above there
And for context, the pixel is lighted with a LUT and the transparency is done also with a LUT.
Pseudo C code:
//transparencyLUT[new][old][transparency level (0 = opaque, 7 = full transparency)]
//lightLUT[color][light level (0 = white, 3 = no change, 7 = full black)]
dest_pixel = transparencyLUT[lightLUT[source_pixel][light]]
[screen_pixel]
[transparency];
What gets me is how I use pretty much the same instructions the C code does, but just less of them?
If you need more info I'll be happy to give more, I just don't want this to be a huge question. I'm just genuinely curious because I'm sorta new to x86 assembly programming and want to learn more about how our cpus actually work.
My only guess is that the out of order execution engine doesn't like my code because its all memory accesses moving to the same register.

Not all instructions take the same time, modern implementations of the CPU can execute (parts of) some instructions in parallel (as long as one doesn't read the data written by the previous one and the required units don't collide). Latest versions do translate the "machine" instructions into lower level, very simple ones, that are scheduled on the fly to be executed on the various units in the CPU in parallel as much as possible, using a whole lot of shadow registers (i.e., one instruction can be using the value in one copy of %eax (old value) after another instruction writes a new value into another copy of %eax (new value), thus decoupling instructions even more. The hoops they jump through for performance's sake...

Related

What is happening in this disassembled code, and what would it look like in C?

I've disassembled this c code (using ida), and ran across this bit of code. I believe the second line is an array, as well as the 5th line, but I'm not sure why it uses a sign extend or a zero extend.
I need to convert the code to C, and I'm not sure why the sign/zero extend is used, or what C code would cause that.
mov ecx, [ebp+var_58]
mov dl, byte ptr [ebp+ecx*2+var_28]
mov [ebp+var_59], dl
mov eax, [ebp+var_58]
movsx ecx, [ebp+eax*2+var_20]
movzx edx, [ebp+var_59]
or edx, ecx
mov [ebp+var_59], dl
unsigned integer types will be zero-extended, while signed types will be sign-extended.
I kinda want to downvote this as too trivial. It's not like there's anything going on that the instruction reference manual doesn't cover. I guess it's different from asking for an explanation of a really simple C program because the trick here is understanding why one might string this sequence of instructions together, rather than just what each one does individually. Being familiar with the idioms used by non-optimizing compilers (store and reload from RAM after every statement) helps.
I'm guessing this is a snippet from inside a function that makes a stack frame, so positive offsets from ebp are where local variables are spilled when they're not live in registers.
mov ecx, [ebp+var_58] ; load var58 into ecx
mov dl, byte ptr [ebp+ecx*2+var_28] ; load a byte from var28[2*var58]
mov [ebp+var_59], dl ; store it to var59
mov eax, [ebp+var_58] ; load var58 again for some reason? can var59 alias var58?
; otherwise we still have the value in ecx, right?
; Or is this non-optimizing compiler output that's really annoying to read?
movsx ecx, [ebp+eax*2+var_20] ; load var20[var58*2]
movzx edx, [ebp+var_59] ; load var59 again
or edx, ecx ; edx = var59|var20[var58*2]
mov [ebp+var_59], dl ; spill var59 back to memory
I guess the default operand size for movsx/movzx is byte-to-dword. word-to-dword also exists, and I'm surprised your disassembler didn't disambiguate with a byte ptr on the memory operand. I'm inferring that it's a byte load because the preceding store to that address was byte-wide.
movsx is used when loading signed data that's smaller than 32b. C's integer-promotion rules dictate that operations on integer types smaller than int are automatically promoted to int (or unsigned int if int can't represent all values. e.g. if unsigned short and unsigned int are the same size).
8bit or 32bit operand sizes are available without operand-size prefix bytes. Some only Intel P6/SnB family CPUs track partial-register dependencies, sign-extending to a full register width on loads can make for faster code (avoiding false dependencies on the previous contents of the register on AMD and Silvermont). So sign or zero extending (as appropriate for the data type) on loads is often the best way to handle narrow memory locations.
Looking at the output of non-optimizing compilers is not usually worth bothering with.
If the code had been generated by a proper optimizing compiler, it would probably be more like
mov ecx, [ebp+var_58] ; var58 is live in ecx
mov al, byte ptr [ebp+ecx*2+var_28] ; var59 = var28[2*var58]
or al, [ebp+ecx*2+var_20] ; var59 |= var20[var58*2]
mov [ebp+var_59], al ; spill var59 to memory
Much easier to read, IMO, without the noise of constantly storing/reloading. You can see when a value is used multiple times without having to notice that a load was from an address that was just stored to.
If a false dependency on the upper 24 bits of eax was causing a problem, we could use movzx or movsx loads into two registers, and do an or r32, r32 like the original, but then still store the low 8. (Using a 32bit or with a memory operand would do a 4B load, not a 1B load, which could cross a cache line or even a page and segfault.)

looping over an array in NASM

I want to learn programming in assembly to write fast and efficient code.
How ever I stumble over a problem I can't solve.
I want to loop over an array of double words and add its components like below:
%include "asm_io.inc"
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
string1 db "result: ",0
array dd 1, 2, 3, 4, 5
segment .bss
segment .text
global sum
sum:
prologue
mov rdi, string1
call print_string
mov rbx, array
mov rdx, 0
mov ecx, 5
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
mov rdi, rdx
call print_int
call print_nl
epilogue
Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:
section .rodata
int_format db "%i",0
string_format db "%s",0
section .text
global print_string, print_nl, print_int, read_int
extern printf, scanf, putchar
print_string:
prologue
; string address has to be passed in rdi
mov rsi,rdi
mov rdi,dword string_format
xor rax,rax
call printf
epilogue
print_nl:
prologue
mov rdi,0xA
xor rax,rax
call putchar
epilogue
print_int:
prologue
;integer arg is in rdi
mov rsi, rdi
mov rdi, dword int_format
xor rax,rax
call printf
epilogue
When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array.
Can somebody tell me why th loop skips the first element?
Edit
I forgot to mention that I'm using a x86_64 Linux system
I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm and layout reg should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.
If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).
The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like
#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);
Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.
An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.
It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P
My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.
So, on to your loop:
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
Never use the loop instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle.
Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later).
When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs insn implicitly uses them that way, which is why they're named si and di. Never use extra mov instructions just because of register names, though. If you want more readability, use C with a good compiler.
;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4] ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi] ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8 ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4
ALIGN 16
.loop:
add eax, [ rsi]
add ebx, [4 + rsi]
add rsi, 8
cmp rsi, rdx
jb .loop ; loop while rsi is Below one-past-the-end
; TODO: handle odd-length
add eax, ebx
ret
Don't use this code without debugging it. gdb (with layout asm and layout reg) is not bad, and available in every Linux distro.
If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)
All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.
segment .rodata ; read-only data
ALIGN 16
array: times 64 dd 1, 2, 3, 4, 5
array_bytes equ $-array
string1 db "result: ",0
segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes] ; array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register. We don't need rdx until later.
movdqu xmm0, [rsi - 32] ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.
ALIGN 16
.loop:
paddd xmm0, [ rsi] ; add packed dwords
paddd xmm1, [16 + rsi]
add rsi, 32
cmp rsi, rdx
jb .loop ; loop: 4 fused-domain uops
paddd xmm0, xmm1
phaddd xmm0, xmm0 ; horizontal add: SSSE3 phaddd is simple but not optimal. Better to pshufd/paddd
phaddd xmm0, xmm0
movd eax, xmm0
; TODO: scalar cleanup loop
ret
Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.
It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb. Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too.
With AVX2 vpadd ymm1, [32+rsi], you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.

A few assembly instructions

Could you please help me understand the purpose of the two assembly instructions in below ? (for more context, assembly + C code at the end). Thanks !
movzx edx,BYTE PTR [edx+0xa]
mov BYTE PTR [eax+0xa],dl
===================================
Assembly code below:
push ebp
mov ebp,esp
and esp,0xfffffff0
sub esp,0x70
mov eax,gs:0x14
mov DWORD PTR [esp+0x6c],eax
xor eax,eax
mov edx,0x8048520
lea eax,[esp+0x8]
mov ecx,DWORD PTR [edx]
mov DWORD PTR [eax],ecx
mov ecx,DWORD PTR [edx+0x4]
mov DWORD PTR [eax+0x4],ecx
movzx ecx,WORD PTR [edx+0x8]
mov WORD PTR [eax+0x8],cx
movzx edx,BYTE PTR [edx+0xa] ; instruction 1
mov BYTE PTR [eax+0xa],dl ; instruction 2
mov edx,DWORD PTR [esp+0x6c]
xor edx,DWORD PTR gs:0x14
je 804844d <main+0x49>
call 8048320 <__stack_chk_fail#plt>
leave
ret
===================================
C source code below (without libraries inclusion):
int main() {
char str_a[100];
strcpy(str_a, "eeeeefffff");
}
It inlined the strcpy() call, the code generator can tell that 11 bytes need to be copied. The string literal "eeeeefffff" has 10 characters, one extra for the zero terminator.
The code optimizer unrolled the copy loop to 4 moves, moving 4 + 4 + 2 + 1 bytes. It needs to be done this way because there is no processor instruction that moves 3 bytes. The instructions you are asking about copy the 11th byte. Using movzx is a bit overkill but it is probably faster than loading the DL register.
Observe the changes in the generated code when you alter the string. Adding an extra letter should unroll to 3 moves, 4 + 4 + 4. When the string gets too long you ought to see it fall back to something like memmove.

What is a (working) example of Conditional mov (cmov) example? [duplicate]

I have this simple binary search member function, where lastIndex, nIter and xi are class members:
uint32 scalar(float z) const
{
uint32 lo = 0;
uint32 hi = lastIndex;
uint32 n = nIter;
while (n--) {
int mid = (hi + lo) >> 1;
// defining this if-else assignment as below cause VS2015
// to generate two cmov instructions instead of a branch
if( z < xi[mid] )
hi = mid;
if ( !(z < xi[mid]) )
lo = mid;
}
return lo;
}
Both gcc and VS 2015 translate the inner loop with a code flow branch:
000000013F0AA778 movss xmm0,dword ptr [r9+rax*4]
000000013F0AA77E comiss xmm0,xmm1
000000013F0AA781 jbe Tester::run+28h (013F0AA788h)
000000013F0AA783 mov r8d,ecx
000000013F0AA786 jmp Tester::run+2Ah (013F0AA78Ah)
000000013F0AA788 mov edx,ecx
000000013F0AA78A mov ecx,r8d
Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?
If not, can anybody suggest how to write a gcc assembler template for this?
Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.
Thanks
As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.
With GCC 6.3, when targeting AMD processors (i.e., -march= any of k8, k10, opteron, amdfam10, btver1, bdver1, btver2, btver2, bdver3, bdver4, znver1, and possibly others), you get exactly the code you want:
mov esi, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
jmp .L2
.L7:
lea edx, [rax+rsi]
mov r8, QWORD PTR [rdi+8]
shr edx
mov r9d, edx
movss xmm1, DWORD PTR [r8+r9*4]
ucomiss xmm1, xmm0
cmovbe eax, edx
cmova esi, edx
.L2:
dec ecx
cmp ecx, -1
jne .L7
rep ret
When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:
mov r9d, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
.L2:
sub ecx, 1
cmp ecx, -1
je .L6
.L8:
lea edx, [rax+r9]
mov rsi, QWORD PTR [rdi+8]
shr edx
mov r8d, edx
vmovss xmm1, DWORD PTR [rsi+r8*4]
vucomiss xmm1, xmm0
ja .L4
sub ecx, 1
mov eax, edx
cmp ecx, -1
jne .L8
.L6:
ret
.L4:
mov r9d, edx
jmp .L2
The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.
(Or, maybe the GCC devs just read Linus's infamous rant. :-)
Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:
push edi
push esi
push ebx
mov esi, DWORD PTR [esp+16]
fld DWORD PTR [esp+20]
mov ebx, DWORD PTR [esi]
mov ecx, DWORD PTR [esi+4]
xor eax, eax
jmp .L2
.L8:
lea edx, [eax+ebx]
shr edx
mov edi, DWORD PTR [esi+8]
fld DWORD PTR [edi+edx*4]
fucomip st, st(1)
cmovbe eax, edx
cmova ebx, edx
.L2:
sub ecx, 1
cmp ecx, -1
jne .L8
fstp st(0)
pop ebx
pop esi
pop edi
ret
I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV didn't change, so this biases the equation in favor of conditional moves.
Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.
So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:
[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.
The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:
Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]
// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
if (i & 1) power *= xp;
xp *= xp;
}
works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp values, even when they are not used.
Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.
If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.
It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.
If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:
Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.
For example, something like this:
inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
{
const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
uint32 result = (value1 ^ value2); // get bits that differ between the two values
result &= mask; // select based on condition
result ^= value2; // condition ? value1 : value2
return result;
}
which you would then call inside of your inner loop like so:
hi = ConditionalSelect(z < xi[mid], mid, hi);
lo = ConditionalSelect(z < xi[mid], lo, mid);
GCC 6.3 produces the following code for this when targeting x86-64:
mov rdx, QWORD PTR [rdi+8]
mov esi, DWORD PTR [rdi]
test edx, edx
mov eax, edx
lea r8d, [rdx-1]
je .L1
mov r9, QWORD PTR [rdi+16]
xor eax, eax
.L3:
lea edx, [rax+rsi]
shr edx
mov ecx, edx
mov edi, edx
movss xmm1, DWORD PTR [r9+rcx*4]
xor ecx, ecx
ucomiss xmm1, xmm0
seta cl // <-- begin our bit-twiddling code
xor edi, esi
xor eax, edx
neg ecx
sub r8d, 1 // this one's not part of our bit-twiddling code!
and edi, ecx
and eax, ecx
xor esi, edi
xor eax, edx // <-- end our bit-twiddling code
cmp r8d, -1
jne .L3
.L1:
rep ret
Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect function, which allows us to write it out-of-line for readability purposes.)
However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.
What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.
On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.
There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:
hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo : mid;
then VS 2013 and 2015 will always emit CMOV instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1) or speed (/O2), and whether you're optimizing for Intel (/favor:Intel64) or AMD (/favor:AMD64).
This does fail to result in CMOV instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect function.
As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.
The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.
Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:
__asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);
The used constraints are:
hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
xi[mid] and z are in the "read" list;
z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.

Dive into the assembly

Function in c:
PHPAPI char *php_pcre_replace(char *regex, int regex_len,
char *subject, int subject_len,
zval *replace_val, int is_callable_replace,
int *result_len, int limit, int *replace_count TSRMLS_DC)
{
pcre_cache_entry *pce; /* Compiled regular expression */
/* Compile regex or get it from cache. */
if ((pce = pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC)) == NULL) {
return NULL;
}
....
}
Its assembly:
php5ts!php_pcre_replace:
1015db70 8b442408 mov eax,dword ptr [esp+8]
1015db74 8b4c2404 mov ecx,dword ptr [esp+4]
1015db78 56 push esi
1015db79 8b74242c mov esi,dword ptr [esp+2Ch]
1015db7d 56 push esi
1015db7e 50 push eax
1015db7f 51 push ecx
1015db80 e8cbeaffff call php5ts!pcre_get_compiled_regex_cache (1015c650)
1015db85 83c40c add esp,0Ch
1015db88 85c0 test eax,eax
1015db8a 7502 jne php5ts!php_pcre_replace+0x1e (1015db8e)
php5ts!php_pcre_replace+0x1c:
1015db8c 5e pop esi
1015db8d c3 ret
The c function call pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC) corresponds to 1015db7d~1015db80 which pushes the 3 parameters to the stack and call it.
But my doubt is,among so many registers,how does the compiler decide to use eax,ecx and esi(this is special,as it's restored before using,why?) as the intermediate to carry to the stack?
There must be some hidden indication in c that tells the compiler to do it this way,right?
No, there is no hidden indication.
This is a typical strategy for generating 80x86 instructions used by many compiler implementations, C and otherwise. For example, the 1980s Intel Fortran-77 compiler, when optimization was turned on, did the same thing.
That is uses eax and ecx preferentially is probably an artifact of avoiding use of esi and edi since those registers cannot directly be used to load byte operands.
Why not ebx and edx? Well, those are preferred by many code generators for holding intermediate pointers in evaluating complex structure evaluation, which is to say, there isn't much reason at all. The compiler just looked for two available registers to use and overwrote them to buffer the values.
Why not reuse eax like this?:
push esi
mov eax,dword ptr [esp+2Ch]
push eax
mov eax,dword ptr [esp+8]
push eax
mov eax,dword ptr [esp+4]
push eax
Because that causes pipeline stalls waiting for eax to complete previous memory cycles, in 80x86s since the 80586 (maybe 80486—it's too long ago to be sure off the top of my head).
The x86 architecture is a strange beast. Each register, though promoted as being "general purpose" by Intel, has its quirks (cx/ecx is tied to the loop instruction for example, and eax:edx is tied to the multiply instruction). That combined with the peculiar ways to optimize execution to avoid cache misses and pipeline stalls often leads to inscrutable generated code by a code generator which factors all that in.

Resources