I am trying to optimize a search through a very short sorted array of doubles to locate a bucket a given value belongs to. Assuming the size of the array is 8 doubles, I have come up with the following sequence of AVX intrinsics:
_data = _mm256_load_pd(array);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos = _mm_popcnt_u32(temp);
_data = _mm256_load_pd(array+4);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos += _mm_popcnt_u32(temp);
To my surprise (I do not have the instruction latency specs in my head..), it turned out that a faster code is generated by gcc for the following C loop:
for(i=0; i<7; ++i) if(array[i+1]>=value) break;
This loop compiles into what I found to be a very efficient code:
lea ecx, [rax+1]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L7
lea ecx, [rax+2]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L8
[... repeat for all elements of array]
so it takes 4 instructions to check 1 bucket (lea, vmovsd, vucomisd, jae). Assuming the value is uniformly spread, on average I will have to check ~3.5 buckets per value. Apparently, this is enough to outperform the AVX code listed earlier.
Now, in a general case the array may of course be larger than 8 elements. If I code a C loop like this:
for(i=0; u<n-1; i++) if(array[i+1]>=value) break;
I get the following instruction sequence for the loop body:
.L76:
mov eax, edx
.L67:
cmp eax, esi
jae .L77
lea edx, [rax+1]
mov ecx, edx
vmovsd xmm1, QWORD PTR [rdi+rcx*8]
vucomisd xmm1, xmm0
jb .L76
I can tell gcc to unroll the loop, but the point is that the number of instructions per element is larger than in the case of the loop with constant bounds, and the code is slower. Also, I do not understand the reason behind using an additional rcx register for addressing in vmovsd.
I can manually modify the assembly for the loop to look something like in the first example, and it does work faster:
.L76:
cmp edx, esi # eax -> edx
jae .L77
lea edx, [rdx+1] # rax -> rdx
vmovsd xmm1, QWORD PTR [rdi+rdx*8]
vucomisd xmm1, xmm0
jb .L76
but I can not seem to make gcc do it. And I know it can - the asm generated in the first example is OK.
Do you have any ideas how to do it otherwise than using inline asm? Or even better - can you suggest a faster implementation of the search?
Not really an answer, but there's no room in the comments for this.
I tested the AVX function against a simple C implementation and got completely different results.
I tested on Windows 7 x64 not Linux but the generated code was very similar.
How the test went:
1) I disabled the CPU's SpeedStep.
2) Within main() I raised the process priority and thread priority to the max (realtime).
3) I ran 10M calls to the tested function to heat up the CPU - activate turbo.
4) I called Sleep(0) to avoid a context switch
5) I called __rdtscp to start measurement
6) in a loop I called either the AVX find index function or the simple C version - like you did. the other implementation was commented out and not used. Loop size was 10M calls.
7) I called __rdtscp again to finish the benchmark.
8) I printed ticks/iterations. to get the average tick count for a call
Note: I declared both 'find index' functions as inline and I confirmed in the disassembly that they got inlined.
The AVX function and the C functions you described are not identical, the C function return a zero based index and the AVX functio returns a 1 based index.
On my system, it took the AVX function 1.1 cycles per iteration and the C function took 4.4 cycles per iteration.
I couldn't force the MSVC compiler to use more than ymm registers :(
Array used:
double A[8] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
Results (avg. ticks/iter):
value = 0.3 (index = 2): AVX: 1.1 | C: 4.4
value = 0.5 (index = 3): AVX: 1.1 | C: 11.1
value = 0.9 (index = 7): AVX: 1.1 | C: 18.1
If the AVX function is corrected to return pos-1, then it will be 50% slower.
You can see that the AVX function works in constant time while the trivial C loop function performance depends on the index you're looking for.
Timing with clock() and running 100M yields similar results, AVX is almost x4 faster for the first test.
Also note that running longer tests reveal different results, but every time AVX holds a similar advantage.
You can try integer comparison. Double comparison is equivalent to int64_t comparison of the same bits with exception for NaNs. It could turn faster. CPU has more integer execution units then SIMD. Just send double* and receive int64_t* in function argument.
Related
I wrote a simple multiplication function in C, and another in assembly code, using GCC's "asm" keyword.
I took the execution time for each of them, and although their times are pretty close, the C function is a little faster than the one in assembly code.
I would like to know why, since I expected for the asm one to be faster. Is it because of the extra "call" (i don't know what word to use) to the GCC's "asm" keyword?
Here is the C function:
int multiply (int a, int b){return a*b;}
And here is the asm one in the C file:
int asmMultiply(int a, int b){
asm ("imull %1,%0;"
: "+r" (a)
: "r" (b)
);
return a;
}
my main where I take the times:
int main(){
int n = 50000;
clock_t asmClock = clock();
while(n>0){
asmMultiply(4,5);
n--;
}
asmClock = clock() - asmClock;
double asmTime = ((double)asmClock)/CLOCKS_PER_SEC;
clock_t cClock = clock();
n = 50000;
while(n>0){
multiply(4,5);
n--;
}
cClock = clock() - cClock;
double cTime = ((double)cClock)/CLOCKS_PER_SEC;
printf("Asm time: %f\n",asmTime);
printf("C code time: %f\n",cTime);
Thanks!
The assembly function is doing more work than the C function — it's initializing mult, then doing the multiplication and assigning the result to mult, and then pushing the value from mult into the return location.
Compilers are good at optimizing; you won't easily beat them on basic arithmetic.
If you really want improvement, use static inline int multiply(int a, int b) { return a * b; }. Or just write a * b (or the equivalent) in the calling code instead of int x = multiply(a, b);.
This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.
Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm would be better than *.
(Hint: probably never. Compilers already know how to optimally multiply integers, and understand the semantics of that operation. Forcing it to use imul instead of auto-vectorizing or doing other optimizations is going to be a loss.)
Both timed regions are empty because both multiplies can optimize away. (The asm is not asm volatile, and you don't use the result.) You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock() overhead.
And even if they weren't, a single imul instruction is basically unmeasurable with a function with as much overhead as clock(). Maybe if you serialized with lfence to force the CPU to wait for imul to retire, before rdtsc... See RDTSCP in NASM always returns the same value
Or you compiled with optimization disabled, which is pointless.
You basically can't measure a C * operator vs. inline asm without some kind of context involving a loop. And then it will be for that context, dependent on what optimizations you defeated by using inline asm. (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)
Measuring only one number for a single x86 instruction doesn't tell you much about it. You need to measure latency, throughput, and front-end uop cost to properly characterize its cost. Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context. How many CPU cycles are needed for each assembly instruction?
The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating. gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5) could optimize to an immediate 20. But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once. (non-volatile means it can CSE the result if you used asmMultiply(4,5) in a loop, though.)
So among other things, inline asm defeats constant propagation. This matters even if only one of the inputs is a constant, and the other is a runtime variable. Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul on modern x86).
https://gcc.gnu.org/wiki/DontUseInlineAsm
The only use-case I could imagine asm helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst] would let it copy-and-multiply with 1 uop instead of 2. But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg. Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part? But no, you'd still have to have something like asm("%2, %1, %0" :...) and that can't work for reg,reg.
You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }, which would work with GCC to let you defeat LEA. Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization. (i.e. it's so niche that in practice you wouldn't ever do this.)
Your code on the Godbolt compiler explorer, with clang7.0 -O3 for the x86-64 System V calling convention:
# clang7.0 -O3 (The functions both inline and optimize away)
main: # #main
push rbx
sub rsp, 16
call clock
mov rbx, rax # save the return value
call clock
sub rax, rbx # end - start time
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp + 8], xmm0 # 8-byte Spill
call clock
mov rbx, rax
call clock
sub rax, rbx # same block again for the 2nd group.
xorps xmm0, xmm0
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp], xmm0 # 8-byte Spill
mov edi, offset .L.str
mov al, 1
movsd xmm0, qword ptr [rsp + 8] # 8-byte Reload
call printf
mov edi, offset .L.str.1
mov al, 1
movsd xmm0, qword ptr [rsp] # 8-byte Reload
call printf
xor eax, eax
add rsp, 16
pop rbx
ret
TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.
How to remove "noise" from GCC/clang assembly output?
C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
Modern x86 cost model
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
I have a counter in an ISR (which is triggered by an external IRQ at 50us). The counter increments and wraps around a MAX_VAL (240).
I have the following code:
if(condition){
counter++;
counter %= MAX_VAL;
doStuff(table[counter]);
}
I am considering an alternative implementation:
if(condition){
//counter++;//probably I would increment before the comparison in production code
if(++counter >= MAX_VAL){
counter=0;
}
doStuff(table[counter]);
}
I know people recommend against trying to optimize like this, but it made me wonder. On x86 what is faster? what value of MAX_VAL would justify the second implemenation?
This gets called about every 50us so reducing the instruction set is not a bad idea. The if(++counter >= MAX_VAL) would be predicted false so it would remove the assignment to 0 in the vast majority of cases. For my purposes id prefer the consistency of the %= implementation.
As #RossRidge says, the overhead will mostly be lost in the noise of servicing an interrupt on a modern x86 (probably at least 100 clock cycles, and many many more if this is part of a modern OS with Meltdown + Spectre mitigation set up).
If MAX_VAL is a power of 2, counter %= MAX_VAL is excellent, especially if counter is unsigned (in which case just a simple and, or in this case a movzx byte to dword which can have zero latency on Intel CPUs. It still has a throughput cost of course: Can x86's MOV really be "free"? Why can't I reproduce this at all?)
Is it possible to fill the last 255-240 entries with something harmless, or repeats of something?
As long as MAX_VAL is a compile-time constant, though, counter %= MAX_VAL will compile efficiently to just a couple multiplies, shift, and adds. (Again, more efficient for unsigned.) Why does GCC use multiplication by a strange number in implementing integer division?
But a check for wrap-around is even better. A branchless check (using cmov) has lower latency than the remainder using a multiplicative inverse, and costs fewer uops for throughput.
As you say, a branchy check can take the check off the critical path entirely, at at the cost of a mispredict sometimes.
// simple version that works exactly like your question
// further optimizations assume that counter isn't used by other code in the function,
// e.g. making it a pointer or incrementing it for the next iteration
void isr_countup(int condition) {
static unsigned int counter = 0;
if(condition){
++counter;
counter = (counter>=MAX_VAL) ? 0 : counter; // gcc uses cmov
//if(counter >= MAX_VAL) counter = 0; // gcc branches
doStuff(table[counter]);
}
}
I compiled many versions of this on the Godbolt compiler explorer, with recent gcc and clang.
(For more about static performance analysis of throughput and latency for short blocks of x86 asm, see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, and other links in the x86 tag wiki, especially Agner Fog's guides.)
clang uses branchless cmov for both versions. I compiled with -fPIE in case you're using that in your kernels. If you can use -fno-pie, then the compiler can save an LEA and use mov edi, [table + 4*rcx], assuming you're on a target where static addresses in position-dependent code fit in 32-bit sign-extended constants (e.g. true in the Linux kernel, but I'm not sure if they compile with -fPIE or do kernel ASLR with relocations when the kernel is loaded.)
# clang7.0 -O3 -march=haswell -fPIE.
# gcc's output is the same (with different registers), but uses `mov edx, 0` before the cmov for no reason, because it's also before a cmp that sets flags
isr_countup: # #isr_countup
test edi, edi
je .LBB1_1 # if condition is false
mov eax, dword ptr [rip + isr_countup.counter]
add eax, 1 # counter++
xor ecx, ecx
cmp eax, 239 # set flags based on (counter , MAX_VAL-1)
cmovbe ecx, eax # ecx = (counter <= MAX_VAL-1) ? 0 : counter
mov dword ptr [rip + isr_countup.counter], ecx # store the old counter
lea rax, [rip + table]
mov edi, dword ptr [rax + 4*rcx] # index the table
jmp doStuff#PLT # TAILCALL
.LBB1_1:
ret
The block of 8 instructions starting at the load of the old counter value is a total of 8 uops (on AMD, or Intel Broadwell and later, where cmov is only 1 uop). The critical-path latency from counter being ready to table[++counter % MAX_VAL] being ready is 1 (add) + 1 (cmp) + 1 (cmov) + load-use latency for the load. i.e. 3 extra cycles. That's the latency of 1 mul instruction. Or 1 extra cycle on older Intel where cmov is 2 uops.
By comparison, the version using modulo is 14 uops for that block with gcc, including a 3-uop mul r32. The latency is at least 8 cycles, I didn't count exactly. (For throughput it's only little bit worse, though, unless the higher latency reduces the ability of out-of-order execution to overlap execution of stuff that depends on the counter.)
Other optimizations
Use the old value of counter, and prepare a value for next time (taking the calculation off the critical path.)
Use a pointer instead of a counter. Saves a couple instructions, at the cost of using 8 bytes instead of 1 or 4 of cache footprint for the variable. (uint8_t counter compiles nicely with some versions, just using movzx to 64-bit).
This counts upward, so the table can be in order. It increments after loading, taking that logic off the critical path dependency chain for out-of-order execution.
void isr_pointer_inc_after(int condition) {
static int *position = table;
if(condition){
int tmp = *position;
position++;
position = (position >= table + MAX_VAL) ? table : position;
doStuff(tmp);
}
}
This compiles really nicely with both gcc and clang, especially if you're using -fPIE so the compiler needs the table address in a register anyway.
# gcc8.2 -O3 -march=haswell -fPIE
isr_pointer_inc_after(int):
test edi, edi
je .L29
mov rax, QWORD PTR isr_pointer_inc_after(int)::position[rip]
lea rdx, table[rip+960] # table+MAX_VAL
mov edi, DWORD PTR [rax] #
add rax, 4
cmp rax, rdx
lea rdx, -960[rdx] # table, calculated relative to table+MAX_VAL
cmovnb rax, rdx
mov QWORD PTR isr_pointer_inc_after(int)::position[rip], rax
jmp doStuff(int)#PLT
.L29:
ret
Again, 8 uops (assuming cmov is 1 uop). This has even lower latency than the counter version possibly could, because the [rax] addressing mode (with RAX coming from a load) has 1 cycle lower latency than an indexed addressing mode, on Sandybridge-family. With no displacement, it never suffers the penalty described in Is there a penalty when base+offset is in a different page than the base?
Or (with a counter) count down towards zero: potentially saves an instruction if the compiler can use flags set by the decrement to detect the value becoming negative. Or we can always use the decremented value, and do the wrap around after that: so when counter is 1, we'd use table[--counter] (table[0]), but store counter=MAX_VAL. Again, takes the wrap check off the critical path.
If you wanted a branch version, you'd want it to branch on the carry flag, because sub eax,1 / jc can macro-fuse into 1 uops, but sub eax,1 / js can't macro-fuse on Sandybridge-family.
x86_64 - Assembly - loop conditions and out of order. But with branchless, it's fine. cmovs (mov if sign flag set, i.e. if the last result was negative) is just as efficient as cmovc (mov if carry flag is set).
It was tricky to get gcc to use the flag results from dec or sub without also doing a cdqe to sign-extend the index to pointer width. I guess I could use intptr_t counter, but that would be silly; might as well just use a pointer. With an unsigned counter, gcc and clang both want to do another cmp eax, 239 or something after the decrement, even though flags are already set just fine from the decrement. But we can get gcc to use SF by checking (int)counter < 0:
// Counts downward, table[] entries need to be reversed
void isr_branchless_dec_after(int condition) {
static unsigned int counter = MAX_VAL-1;
if(condition){
int tmp = table[counter];
--counter;
counter = ((int)counter < 0) ? MAX_VAL-1 : counter;
//counter = (counter >= MAX_VAL) ? MAX_VAL-1 : counter;
//counter = (counter==0) ? MAX_VAL-1 : counter-1;
doStuff(tmp);
}
}
# gcc8.2 -O3 -march=haswell -fPIE
isr_branchless_dec_after(int):
test edi, edi
je .L20
mov ecx, DWORD PTR isr_branchless_dec_after(int)::counter[rip]
lea rdx, table[rip]
mov rax, rcx # stupid compiler, this copy is unneeded
mov edi, DWORD PTR [rdx+rcx*4] # load the arg for doStuff
mov edx, 239 # calculate the next counter value
dec eax
cmovs eax, edx
mov DWORD PTR isr_branchless_dec_after(int)::counter[rip], eax # and store it
jmp doStuff(int)#PLT
.L20:
ret
still 8 uops (should be 7), but no extra latency on the critical path. So all of the extra decrement and wrap instructions are juicy instruction-level parallelism for out-of-order execution.
I'm looking for an efficient-to-unpack (in terms of small number of basic ALU ops in the generated code) way of encoding 3 base-6 digits (i.e. 3 numbers in the range [0,5]) in 8 bits. Only one is needed at a time, so approaches that need to decode all three in order to access one are probably not good unless the cost of decoding all three is very low.
The obvious method is of course:
x = b%6; // 8 insns
y = b/6%6; // 13 insns
z = b/36; // 5 insns
The instruction counts are measured on x86_64 with gcc>=4.8 which knows how to avoid divs.
Another method (using a different encoding) is:
b *= 6
x = b>>8;
b &= 255;
b *= 6
y = b>>8;
b &= 255;
b *= 6
z = b>>8;
This encoding has more than one representation for many tuples (it uses the whole 8bit range rather than just [0,215]) and appears more efficient if you want all 3 outputs, but wasteful if you only want one.
Are there better approaches?
Target language is C but I've tagged this assembly as well since answering requires some consideration of the instructions that would be generated.
As discussed in comments, a LUT would be excellent if it stays hot in cache. uint8_t LUT[3][256] would need the selector scaled by 256, which takes an extra instruction if it's not a compile-time constant. Scaling by 216 to pack the LUT better is only 1 or 2 instructions more expensive. struct3 LUT[216] is nice, where the struct has a 3-byte array member. On x86, this compiles extremely well in position-dependent code where the LUT base can be a 32-bit absolute as part of the addressing mode (if the table is static):
struct { uint8_t vals[3]; } LUT[216];
unsigned decode_LUT(uint8_t b, unsigned selector) {
return LUT[b].vals[selector];
}
gcc7 -O3 on Godbolt for x86-64 and AArch64
movzx edi, dil
mov esi, esi # zero-extension to 64-bit: goes away when inlining.
lea rax, LUT[rdi+rdi*2] # multiply by 3 and add the base
movzx eax, BYTE PTR [rax+rsi] # then index by selector
ret
Silly gcc used a 3-component LEA (3 cycle latency and runs on fewer ports) instead of using LUT as a disp32 for the actual load (no extra latency for an indexed addressing mode, I think).
This layout has the added advantage of locality if you ever need to decode multiple components of the same byte.
In PIC / PIE code, this costs 2 extra instructions, unfortunately:
movzx edi, dil
lea rax, LUT[rip] # RIP-relative LEA instead of absolute as part of another addressing mode
mov esi, esi
lea rdx, [rdi+rdi*2]
add rax, rdx
movzx eax, BYTE PTR [rax+rsi]
ret
But that's still cheap, and all the ALU instructions are single-cycle latency.
Your 2nd ALU unpacking strategy is promising. I thought at first we could use a single 64-bit multiply to get b*6, b*6*6, and b*6*6*6 in different positions of the same 64-bit integer. (b * ((6ULL*6*6<<32) + (36<<16) + 6)
But the upper byte of each multiply result does depend on masking back to 8-bit after each multiply by 6. (If you can think of a way to not require that, one multiple and shift would be very cheap, especially on 64-bit ISAs where the entire 64-bit multiply result is in one register).
Still, x86 and ARM can multiply by 6 and mask in 3 cycles of latency, the same or better latency than a multiply, or less on Intel CPUs with zero-latency movzx r32, r8, if the compiler avoids using parts of the same register for movzx.
add eax, eax ; *2
lea eax, [rax + rax*2] ; *3
movzx ecx, al ; 0 cycle latency on Intel
.. repeat for next steps
ARM / AArch64 is similarly good, with add r0, r0, r0 lsl #1 for multiply by 3.
As a branchless way to select one of the three, you could consider storing (from ah / ch / ... to get the shift for free) to an array, then loading with the selector as the index. This costs store/reload latency (~5 cycles), but is cheap for throughput and avoids branch misses. (Possibly a 16-bit store and then a byte reload would be good, scaling the selector in the load address and adding 1 to get the high byte, saving an extract instruction before each store on ARM).
This is in fact what gcc emits if you write it this way:
unsigned decode_ALU(uint8_t b, unsigned selector) {
uint8_t decoded[3];
uint32_t tmp = b * 6;
decoded[0] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[1] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[2] = tmp >> 8;
return decoded[selector];
}
movzx edi, dil
mov esi, esi
lea eax, [rdi+rdi*2]
add eax, eax
mov BYTE PTR -3[rsp], ah # store high half of mul-by-6
movzx eax, al # costs 1 cycle: gcc doesn't know about zero-latency movzx?
lea eax, [rax+rax*2]
add eax, eax
mov BYTE PTR -2[rsp], ah
movzx eax, al
lea eax, [rax+rax*2]
shr eax, 7
mov BYTE PTR -1[rsp], al
movzx eax, BYTE PTR -3[rsp+rsi]
ret
The first store's data is ready 4 cycles after the input to the first movzx, or 5 if you include the extra 1c of latency for reading ah when it's not renamed separately on Intel HSW/SKL. The next 2 stores are 3 cycles apart.
So the total latency is ~10 cycles from b input to result output, if selector=0. Otherwise 13 or 16 cycles.
Measuring a number of different approaches in-place in the function that needs to do this, the practical answer is really boring: it doesn't matter. They're all running at about 50ns per call, and other work is dominating. So for my purposes, the approach that pollutes the cache and branch predictors the least is probably the best. That seems to be:
(b * (int[]){2048,342,57}[i] >> 11) % 6;
where b is the byte containing the packed values and i is the index of the value wanted. The magic constants 342 and 57 are just the multiplicative constants GCC generates for division by 6 and 36, respectively, scaled to a common shift of 11. The final %6 is spurious in the /36 case (i==2) but branching to avoid it does not seem worthwhile.
On the other hand, if doing this same work in a context where there wasn't an interface constraint to have the surrounding function call overhead per lookup, I think an approach like Peter's would be preferable.
How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?
I tried some C code like the one listed below, but unsuccessfully:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq
P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.
rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.
InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).
Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)
The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)
However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.
#include <string.h>
int string_equal(const char *s) {
return 0 == strcmp(s, "test1");
}
gcc7.3 -O3 output from Godbolt
.LC0:
.string "test1"
string_cmp:
mov rsi, rdi
mov ecx, 6
mov edi, OFFSET FLAT:.LC0
repz cmpsb
setne al
movzx eax, al
ret
If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).
gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.
The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.
int cmp_mem(const char *s) {
return 0 == memcmp(s, "test1", 6);
}
cmp DWORD PTR [rdi], 1953719668 # 0x74736574
je .L8
.L5:
mov eax, 1
xor eax, 1 # missed optimization here after the memcmp pattern; should just xor eax,eax
ret
.L8:
xor eax, eax
cmp WORD PTR [rdi+4], 49 # check last 2 bytes
jne .L5
xor eax, 1
ret
Controlling this behaviour
The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.
Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.
It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.
I have this simple binary search member function, where lastIndex, nIter and xi are class members:
uint32 scalar(float z) const
{
uint32 lo = 0;
uint32 hi = lastIndex;
uint32 n = nIter;
while (n--) {
int mid = (hi + lo) >> 1;
// defining this if-else assignment as below cause VS2015
// to generate two cmov instructions instead of a branch
if( z < xi[mid] )
hi = mid;
if ( !(z < xi[mid]) )
lo = mid;
}
return lo;
}
Both gcc and VS 2015 translate the inner loop with a code flow branch:
000000013F0AA778 movss xmm0,dword ptr [r9+rax*4]
000000013F0AA77E comiss xmm0,xmm1
000000013F0AA781 jbe Tester::run+28h (013F0AA788h)
000000013F0AA783 mov r8d,ecx
000000013F0AA786 jmp Tester::run+2Ah (013F0AA78Ah)
000000013F0AA788 mov edx,ecx
000000013F0AA78A mov ecx,r8d
Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?
If not, can anybody suggest how to write a gcc assembler template for this?
Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.
Thanks
As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.
With GCC 6.3, when targeting AMD processors (i.e., -march= any of k8, k10, opteron, amdfam10, btver1, bdver1, btver2, btver2, bdver3, bdver4, znver1, and possibly others), you get exactly the code you want:
mov esi, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
jmp .L2
.L7:
lea edx, [rax+rsi]
mov r8, QWORD PTR [rdi+8]
shr edx
mov r9d, edx
movss xmm1, DWORD PTR [r8+r9*4]
ucomiss xmm1, xmm0
cmovbe eax, edx
cmova esi, edx
.L2:
dec ecx
cmp ecx, -1
jne .L7
rep ret
When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:
mov r9d, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
.L2:
sub ecx, 1
cmp ecx, -1
je .L6
.L8:
lea edx, [rax+r9]
mov rsi, QWORD PTR [rdi+8]
shr edx
mov r8d, edx
vmovss xmm1, DWORD PTR [rsi+r8*4]
vucomiss xmm1, xmm0
ja .L4
sub ecx, 1
mov eax, edx
cmp ecx, -1
jne .L8
.L6:
ret
.L4:
mov r9d, edx
jmp .L2
The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.
(Or, maybe the GCC devs just read Linus's infamous rant. :-)
Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:
push edi
push esi
push ebx
mov esi, DWORD PTR [esp+16]
fld DWORD PTR [esp+20]
mov ebx, DWORD PTR [esi]
mov ecx, DWORD PTR [esi+4]
xor eax, eax
jmp .L2
.L8:
lea edx, [eax+ebx]
shr edx
mov edi, DWORD PTR [esi+8]
fld DWORD PTR [edi+edx*4]
fucomip st, st(1)
cmovbe eax, edx
cmova ebx, edx
.L2:
sub ecx, 1
cmp ecx, -1
jne .L8
fstp st(0)
pop ebx
pop esi
pop edi
ret
I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV didn't change, so this biases the equation in favor of conditional moves.
Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.
So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:
[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.
The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:
Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]
// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
if (i & 1) power *= xp;
xp *= xp;
}
works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp values, even when they are not used.
Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.
If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.
It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.
If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:
Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.
For example, something like this:
inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
{
const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
uint32 result = (value1 ^ value2); // get bits that differ between the two values
result &= mask; // select based on condition
result ^= value2; // condition ? value1 : value2
return result;
}
which you would then call inside of your inner loop like so:
hi = ConditionalSelect(z < xi[mid], mid, hi);
lo = ConditionalSelect(z < xi[mid], lo, mid);
GCC 6.3 produces the following code for this when targeting x86-64:
mov rdx, QWORD PTR [rdi+8]
mov esi, DWORD PTR [rdi]
test edx, edx
mov eax, edx
lea r8d, [rdx-1]
je .L1
mov r9, QWORD PTR [rdi+16]
xor eax, eax
.L3:
lea edx, [rax+rsi]
shr edx
mov ecx, edx
mov edi, edx
movss xmm1, DWORD PTR [r9+rcx*4]
xor ecx, ecx
ucomiss xmm1, xmm0
seta cl // <-- begin our bit-twiddling code
xor edi, esi
xor eax, edx
neg ecx
sub r8d, 1 // this one's not part of our bit-twiddling code!
and edi, ecx
and eax, ecx
xor esi, edi
xor eax, edx // <-- end our bit-twiddling code
cmp r8d, -1
jne .L3
.L1:
rep ret
Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect function, which allows us to write it out-of-line for readability purposes.)
However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.
What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.
On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.
There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:
hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo : mid;
then VS 2013 and 2015 will always emit CMOV instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1) or speed (/O2), and whether you're optimizing for Intel (/favor:Intel64) or AMD (/favor:AMD64).
This does fail to result in CMOV instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect function.
As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.
The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.
Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:
__asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);
The used constraints are:
hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
xi[mid] and z are in the "read" list;
z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.