Why vectorizing the loop does not have performance improvement

Why vectorizing the loop does not have performance improvement - c

I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define LEN 10000000
int main(){
struct timeval stTime, endTime;
double* a = (double*)malloc(LEN*sizeof(*a));
double* b = (double*)malloc(LEN*sizeof(*b));
double* c = (double*)malloc(LEN*sizeof(*c));
int k;
for(k = 0; k < LEN; k++){
a[k] = rand();
b[k] = rand();
}
gettimeofday(&stTime, NULL);
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
gettimeofday(&endTime, NULL);
FILE* fh = fopen("dump", "w");
for(k = 0; k < LEN; k++)
fprintf(fh, "c[%d] = %f\t", k, c[k]);
fclose(fh);
double timeE = (double)(endTime.tv_usec + endTime.tv_sec*1000000 - stTime.tv_usec - stTime.tv_sec*1000000);
printf("Time elapsed: %f\n", timeE);
return 0;
}
In this code, I am simply initializing and multiplying two vectors. The results are saved in vector c. What I am mainly interested in is the effect of vectorizing following loop:
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
I compile the code using following two commands:
1) icc -O2 TestSMID.c -o TestSMID -no-vec -no-simd
2) icc -O2 TestSMID.c -o TestSMID -vec-report2
I expect to see performance improvement since the second command successfully vectorizes the loop. However, my studies show that there is no performance improvement when the loop is vectorized.
I may have missed something here since I am not super familiar with the topic. So, please let me know if there is something wrong with my code.
Thanks in advance for your help.
PS: I am using Mac OSX, so there is no need to align the data as all the allocated memories are 16-byte aligned.
Edit:
I would like to first thank you all for your comments and answers.
I thought about the answer proposed by #Mysticial and there are some further points that should be mentioned here.
Firstly, as #Vinska mentioned, c[k]=a[k]*b[k] does not take only one cycle. In addition to loop index increment and the comparison made to ensure that k is smaller than LEN, there are other things to be done to perform the operation. Having a look at the assembly code generated by the compiler, it can be seen that a simple multiplication needs much more than one cycle. The vectorized version looks like:
L_B1.9: # Preds L_B1.8
movq %r13, %rax #25.5
andq $15, %rax #25.5
testl %eax, %eax #25.5
je L_B1.12 # Prob 50% #25.5
# LOE rbx r12 r13 r14 r15 eax
L_B1.10: # Preds L_B1.9
testb $7, %al #25.5
jne L_B1.32 # Prob 10% #25.5
# LOE rbx r12 r13 r14 r15
L_B1.11: # Preds L_B1.10
movsd (%r14), %xmm0 #26.16
movl $1, %eax #25.5
mulsd (%r15), %xmm0 #26.23
movsd %xmm0, (%r13) #26.9
# LOE rbx r12 r13 r14 r15 eax
L_B1.12: # Preds L_B1.11 L_B1.9
movl %eax, %edx #25.5
movl %eax, %eax #26.23
negl %edx #25.5
andl $1, %edx #25.5
negl %edx #25.5
addl $10000000, %edx #25.5
lea (%r15,%rax,8), %rcx #26.23
testq $15, %rcx #25.5
je L_B1.16 # Prob 60% #25.5
# LOE rdx rbx r12 r13 r14 r15 eax
L_B1.13: # Preds L_B1.12
movl %eax, %eax #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.14: # Preds L_B1.14 L_B1.13
movups (%r15,%rax,8), %xmm0 #26.23
movsd (%r14,%rax,8), %xmm1 #26.16
movhpd 8(%r14,%rax,8), %xmm1 #26.16
mulpd %xmm0, %xmm1 #26.23
movntpd %xmm1, (%r13,%rax,8) #26.9
addq $2, %rax #25.5
cmpq %rdx, %rax #25.5
jb L_B1.14 # Prob 99% #25.5
jmp L_B1.20 # Prob 100% #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.16: # Preds L_B1.12
movl %eax, %eax #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.17: # Preds L_B1.17 L_B1.16
movsd (%r14,%rax,8), %xmm0 #26.16
movhpd 8(%r14,%rax,8), %xmm0 #26.16
mulpd (%r15,%rax,8), %xmm0 #26.23
movntpd %xmm0, (%r13,%rax,8) #26.9
addq $2, %rax #25.5
cmpq %rdx, %rax #25.5
jb L_B1.17 # Prob 99% #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.18: # Preds L_B1.17
mfence #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.19: # Preds L_B1.18
mfence #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.20: # Preds L_B1.14 L_B1.19 L_B1.32
cmpq $10000000, %rdx #25.5
jae L_B1.24 # Prob 0% #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.22: # Preds L_B1.20 L_B1.22
movsd (%r14,%rdx,8), %xmm0 #26.16
mulsd (%r15,%rdx,8), %xmm0 #26.23
movsd %xmm0, (%r13,%rdx,8) #26.9
incq %rdx #25.5
cmpq $10000000, %rdx #25.5
jb L_B1.22 # Prob 99% #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.24: # Preds L_B1.22 L_B1.20
And non-vectoized version is:
L_B1.9: # Preds L_B1.8
xorl %eax, %eax #25.5
# LOE rbx r12 r13 r14 r15 eax
L_B1.10: # Preds L_B1.10 L_B1.9
lea (%rax,%rax), %edx #26.9
incl %eax #25.5
cmpl $5000000, %eax #25.5
movsd (%r15,%rdx,8), %xmm0 #26.16
movsd 8(%r15,%rdx,8), %xmm1 #26.16
mulsd (%r13,%rdx,8), %xmm0 #26.23
mulsd 8(%r13,%rdx,8), %xmm1 #26.23
movsd %xmm0, (%rbx,%rdx,8) #26.9
movsd %xmm1, 8(%rbx,%rdx,8) #26.9
jb L_B1.10 # Prob 99% #25.5
# LOE rbx r12 r13 r14 r15 eax
Beside this, the processor does not load only 24 bytes. In each access to memory, a full line (64 bytes) is loaded. More importantly, since the memory required for a, b, and c is contiguous, prefetcher would definitely help a lot and loads next blocks in advance.
Having said that, I think the memory bandwidth calculated by #Mysticial is too pessimistic.
Moreover, using SIMD to improve the performance of program for a very simple addition is mentioned in Intel Vectorization Guide. Therefore, it seems we should be able to gain some performance improvement for this very simple loop.
Edit2:
Thanks again for your comments. Also, thank to #Mysticial sample code, I finally saw the effect of SIMD on performance improvement. The problem, as Mysticial mentioned, was the memory bandwidth. With choosing small size for a, b, and c which fit into the L1 cache, it can be seen that SIMD can help to improve the performance significantly. Here are the results that I got:
icc -O2 -o TestSMIDNoVec -no-vec TestSMID2.c: 17.34 sec
icc -O2 -o TestSMIDVecNoUnroll -vec-report2 TestSMID2.c: 9.33 sec
And unrolling the loop improves the performance even further:
icc -O2 -o TestSMIDVecUnroll -vec-report2 TestSMID2.c -unroll=8: 8.6sec
Also, I should mention that it takes only one cycle for my processor to complete an iteration when compiled with -O2.
PS: My computer is a Macbook Pro core i5 #2.5GHz (dual core)

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date.
See the end of this answer for the 2017 update.
Original Answer (2013):
Because you're bottlenecked by memory bandwidth.
While vectorization and other micro-optimizations can improve the speed of computation, they can't increase the speed of your memory.
In your example:
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
You are making a single pass over all the memory doing very little work. This is maxing out your memory bandwidth.
So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.
A typical desktop machine of 2013 has on the order of 10 GB/s of memory bandwidth*.Your loop touches 24 bytes/iteration.
Without vectorization, a modern x64 processor can probably do about 1 iteration a cycle*.
Suppose you're running at 4 GHz:
(4 * 10^9) * 24 bytes/iteration = 96 GB/s
That's almost 10x of your memory bandwidth - without vectorization.
*Not surprisingly, a few people doubted the numbers I gave above since I gave no citation. Well those were off the top of my head from experience. So here's some benchmarks to prove it.
The loop iteration can run as fast as 1 cycle/iteration:
We can get rid of the memory bottleneck if we reduce LEN so that it fits in cache.
(I tested this in C++ since it was easier. But it makes no difference.)
#include <iostream>
#include <time.h>
using std::cout;
using std::endl;
int main(){
const int LEN = 256;
double *a = (double*)malloc(LEN*sizeof(*a));
double *b = (double*)malloc(LEN*sizeof(*a));
double *c = (double*)malloc(LEN*sizeof(*a));
int k;
for(k = 0; k < LEN; k++){
a[k] = rand();
b[k] = rand();
}
clock_t time0 = clock();
for (int i = 0; i < 100000000; i++){
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
}
clock_t time1 = clock();
cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}
Processor: Intel Core i7 2600K # 4.2 GHz
Compiler: Visual Studio 2012
Time: 6.55 seconds
In this test, I ran 25,600,000,000 iterations in only 6.55 seconds.
6.55 * 4.2 GHz = 27,510,000,000 cycles
27,510,000,000 / 25,600,000,000 = 1.074 cycles/iteration
Now if you're wondering how it's possible to do:
2 loads
1 store
1 multiply
increment counter
compare + branch
all in one cycle...
It's because modern processors and compilers are awesome.
While each of these operations have latency (especially the multiply), the processor is able to execute multiple iterations at the same time. My test machine is a Sandy Bridge processor, which is capable of sustaining 2x128b loads, 1x128b store, and 1x256b vector FP multiply every single cycle. And potentially another one or two vector or integer ops, if the loads are memory source operands for micro-fused uops. (2 loads + 1 store throughput only when using 256b AVX loads/stores, otherwise only two total memory ops per cycle (at most one store)).
Looking at the assembly (which I'll omit for brevity), it seems that the compiler unrolled the loop, thereby reducing the looping overhead. But it didn't quite manage to vectorize it.
Memory bandwidth is on the order of 10 GB/s:
The easiest way to test this is via a memset():
#include <iostream>
#include <time.h>
using std::cout;
using std::endl;
int main(){
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
clock_t time0 = clock();
for (int i = 0; i < 100; i++){
memset(a,0xff,LEN);
}
clock_t time1 = clock();
cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}
Processor: Intel Core i7 2600K # 4.2 GHz
Compiler: Visual Studio 2012
Time: 5.811 seconds
So it takes my machine 5.811 seconds to write to 100 GB of memory. That's about 17.2 GB/s.
And my processor is on the higher end. The Nehalem and Core 2 generation processors have less memory bandwidth.
Update March 2017:
As of 2017, things have gotten more complicated.
Thanks to DDR4 and quad-channel memory, it is no longer possible for a single thread to saturate memory bandwidth. But the problem of bandwidth doesn't necessarily go away. Even though bandwidth has gone up, processor cores have also improved - and there are more of them.
To put it mathematically:
Each core has a bandwidth limit X.
Main memory has a bandwidth limit of Y.
On older systems, X > Y.
On current high-end systems, X < Y. But X * (# of cores) > Y.
Back in 2013: Sandy Bridge # 4 GHz + dual-channel DDR3 # 1333 MHz
No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~17 GB/s
Vectorized SSE* (16-byte load/stores): X = 64 GB/s and Y = ~17 GB/s
Now in 2017: Haswell-E # 4 GHz + quad-channel DDR4 # 2400 MHz
No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~70 GB/s
Vectorized AVX* (32-byte load/stores): X = 64 GB/s and Y = ~70 GB/s
(For both Sandy Bridge and Haswell, architectural limits in the cache will limit bandwidth to about 16 bytes/cycle regardless of SIMD width.)
So nowadays, a single thread will not always be able to saturate memory bandwidth. And you will need to vectorize to achieve that limit of X. But you will still hit the main memory bandwidth limit of Y with 2 or more threads.
But one thing hasn't changed and probably won't change for a long time: You will not be able to run a bandwidth-hogging loop on all cores without saturating the total memory bandwidth.

As Mysticial already described, main-memory bandwidth limitations are the bottleneck for large buffers here. The way around this is to redesign your processing to work in chunks that fit in the cache. (Instead of multiplying a whole 200MiB of doubles, multiply just 128kiB, then do something with that. So the code that uses the output of the multiply will find it still in L2 cache. L2 is typically 256kiB, and is private to each CPU core, on recent Intel designs.)
This technique is called cache blocking, or loop tiling. It might be tricky for some algorithms, but the payoff is the difference between L2 cache bandwidth vs. main memory bandwidth.
If you do this, make sure the compiler isn't still generating streaming stores (movnt...). Those writes bypass the caches to avoid polluting it with data that won't fit. The next read of that data will need to touch main memory.

EDIT: Modified the answer a lot. Also, please disregard most of what I wrote before about Mystical's answer not being entirely correct.
Though, I still do not agree it being bottlenecked by memory, as despite doing a very wide variety of tests, I couldn't see any signs of the original code being bound by memory speed. Meanwhile it kept showing clear signs of being CPU-bound.
There can be many reasons. And since the reason[s] can be very hardware-dependent, I decided I shouldn't speculate based on guesses.
Just going to outline these things I encountered during later testing, where I used a much more accurate and reliable CPU time measuring method and looping-the-loop 1000 times. I believe this information could be of help. But please take it with a grain of salt, as it's hardware dependent.
When using instructions from the SSE family, vectorized code I got was over 10% faster vs. non-vectorized code.
Vectorized code using SSE-family and vectorized code using AVX ran more or less with the same performance.
When using AVX instructions, non-vectorized code ran the fastest - 25% or more faster than every other thing I tried.
Results scaled linearly with CPU clock in all cases.
Results were hardly affected by memory clock.
Results were considerably affected by memory latency - much more than memory clock, but not nearly as much as CPU clock affected the results.
WRT Mystical's example of running nearly 1 iteration per clock - I didn't expect the CPU scheduler to be that efficient and was assuming 1 iteration every 1.5-2 clock ticks. But to my surprise, that is not the case; I sure was wrong, sorry about that. My own CPU ran it even more efficiently - 1.048 cycles/iteration. So I can attest to this part of Mystical's answer to be definitely right.

Just in case a[] b[] and c[] are fighting for the L2 cache ::
#include <string.h> /* for memcpy */
...
gettimeofday(&stTime, NULL);
for(k = 0; k < LEN; k += 4) {
double a4[4], b4[4], c4[4];
memcpy(a4,a+k, sizeof a4);
memcpy(b4,b+k, sizeof b4);
c4[0] = a4[0] * b4[0];
c4[1] = a4[1] * b4[1];
c4[2] = a4[2] * b4[2];
c4[3] = a4[3] * b4[3];
memcpy(c+k,c4, sizeof c4);
}
gettimeofday(&endTime, NULL);
Reduces the running time from 98429.000000 to 67213.000000;
unrolling the loop 8-fold reduces it to 57157.000000 here.

Related

Ternary operator vs array?

In C, is indexing an array faster than the ?: operator?
For example, would (const int[]){8, 14}[N > 10] be faster than N > 10? 14 : 8?

Stick with the ternary operator:
It is simpler
It is fewer characters to type
It is easier to read and understand
It is more maintainable
It is likely not the main bottleneck in your application
For the CPU it is a simple comparison
Compilers are clever, if the array solution was faster, compilers would already generate the same code for both variants
Mandatory quote (emphasis mine):
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%
— Donald Knuth • https://wiki.c2.com/?PrematureOptimization
Now that's out of the way, let's compare what compilers actually produce.
#include <stdlib.h>
int ternary(int n) { return n > 10 ? 14 : 8; }
int array(int n) { return (const int[]){8, 14}[n > 10]; }
Compile with (g)cc 10.2.1 in Ubuntu and optimizations enabled:
$ cc -O3 -S -fno-stack-protector -fno-asynchronous-unwind-tables ternary.c
-S stops after compilation and does not assemble. You will end up with a .s file which contains the generated assembly code. (the -fno… flags are to disable additional code generation which is not required for our example).
ternary.s assembly code, lines unrelated to the methods removed:
ternary:
endbr64
cmpl $10, %edi
movl $8, %edx
movl $14, %eax
cmovle %edx, %eax
ret
array:
endbr64
movq .LC0(%rip), %rax
movq %rax, -8(%rsp)
xorl %eax, %eax
cmpl $10, %edi
setg %al
movl -8(%rsp,%rax,4), %eax
ret
.LC0:
.long 8
.long 14
If you compare them, you will notice a lot more instructions for the array version: 6 instructions vs. 4 instructions.
There is no reason to write the more complicated code which every developer has to read twice; the shorter and straight-forward code compiles to more efficient machine code.

Use of the compound literal (and array in general) will be much less efficient as arrays are created (by current real-world compilers) despite the optimization level. Worse, they're created on the stack, not just indexing static constant data (which would still be slower, at least higher latency, than an ALU select operation like x86 cmov or AArch64 csel which most modern ISAs have).
I have tested it using all compilers I use (including Keil and IAR) and some I don't use (icc and clang).
int foo(int N)
{
return (const int[]){8, 14}[N > 10];
}
int bar(int N)
{
return N > 10? 14 : 8;
}
foo:
mov rax, QWORD PTR .LC0[rip] # load 8 bytes from .rodata
mov QWORD PTR [rsp-8], rax # store both elements to the stack
xor eax, eax # prepare a zeroed reg for setcc
cmp edi, 10
setg al # materialize N>10 as a 0/1 integer
mov eax, DWORD PTR [rsp-8+rax*4] # index the array with it
ret
bar:
cmp edi, 10
mov edx, 8 # set up registers with both constants
mov eax, 14
cmovle eax, edx # ALU select operation on FLAGS from CMP
ret
.LC0:
.long 8
.long 14
https://godbolt.org/z/qK65Gv

Trying to obtain addq, but keep getting leaq

So, I'm trying to get familiar with assembly and trying to reverse-engineer some code. My problem lies in trying to decode addq which I understands performs Source + Destination= Destination.
I am using the assumptions that parameters x, y, and z are passed in registers %rdi, %rsi, and %rdx. The return value is stored in %rax.
long someFunc(long x, long y, long z){
1. long temp=(x-z)*x;
2. long temp2= (temp<<63)>>63;
3. long temp3= (temp2 ^ x);
4. long answer=y+temp3;
5. return answer;
}
So far everything above line 4 is exactly what I am wanting. However, line 4 gives me leaq (%rsi,%rdi), %rax rather than addq %rsi, %rax. I'm not sure if this is something I am doing wrong, but I am looking for some insight.

Those instructions aren't equivalent. For LEA, rax is a pure output. For your hoped-for add, it's rax += rsi so the compiler would have to mov %rdi, %rax first. That's less efficient so it doesn't do that.
lea is a totally normal way for compilers to implement dst = src1 + src2, saving a mov instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax implements result = y*3. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.
Assuming you meant t3 to be the same variable as temp3, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add instruction without any extra mov instructions, instead of needing lea.
Clang chooses to do better register allocation than GCC so it can just use add instead of needing lea for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.
Clang also optimized the shifts into andl $1, %eax / negq %rax to create the 0 or -1 result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1.
# side by side comparison, like the Godbolt diff pane
clang: | gcc:
movl %edi, %eax movq %rdi, %rax
subl %edx, %eax subq %rdx, %rdi
imull %edi, %eax imulq %rax, %rdi # temp1
andl $1, %eax salq $63, %rdi
negq %rax sarq $63, %rdi # temp2
xorq %rdi, %rax xorq %rax, %rdi # temp3
addq %rsi, %rax leaq (%rdi,%rsi), %rax # answer
retq ret
Notice that clang chose imul %edi, %eax (into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea at the end instead of an add.
Compilers sometimes even get stuck with an extra mov instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.
Other missed optimizations
GCC and clang could have optimized by using and instead of imul to set the low bit only if both inputs are odd.
Also, since only the low bit of the sub output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea instead of mov/sub as the first instruction.
lea (%rdi,%rsi), %eax
and %edi, %eax # low bit matches (x-z)*x
andl $1, %eax # keep only the low bit
negq %rax # temp2
Lets make a truth table for the low bits of x and z to see how this shakes out if we want to optimize more / differently:
# truth table for low bit: input to shifts that broadcasts this to all bits
x&1 | z&1 | x-z = x^z | x*(x-z) = x & (x-z)
0 0 0 0
0 1 1 0
1 0 1 1
1 1 0 0
x & (~z) = BMI1 andn
So temp2 = (x^z) & x & 1 ? -1 : 0. But also temp2 = -((x & ~z) & 1).
We can rearrange that to -((x&1) & ~z) which lets us start with not z and and $1, x in parallel, for better ILP. Or if z might be ready first, we could do operations on it and shorten the critical path from x -> answer, at the expense of z.
Or with a BMI1 andn instruction which does (~z) & x, we can do this in one instruction. (Plus another to isolate the low bit)
I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:
# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
not %edx # ~z
and $1, %edx
and %edi, %edx # x&1 & ~z = low bit of temp1
neg %rdx # temp2 = 0 or -1
xor %rdi, %rdx # temp3 = x or ~x
lea (%rsi, %rdx), %rax # answer = y + temp3
ret
So there's still no ILP, unless z is ready before x and/or y. Using an extra mov instruction, we could do x&1 in parallel with not z
Possibly you could do something with test/setz or cmov, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.
I haven't looked into optimizing the final xor and add, but note that temp3 is basically a conditional NOT of x. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x. Maybe improve ILP / latency by doing x+y and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.
# return y + x or y + (~x) according to the condition on x and z
someFunc:
lea (%rsi, %rdi), %rax # y + x
andn %edi, %edx, %ecx # ecx = x & (~z)
not %rdi # ~x
add %rsi, %rdi # y + (~x)
test $1, %cl
cmovnz %rdi, %rax # select between y+x and y+~x
retq
This has more ILP, but needs BMI1 andn to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.
The other function could be 5 uops using BMI andn.
In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.
vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul instead of and (3 cycle latency just for that instruction).

How can a single sqrt() runs twice slower than when it was put in a for loop

I'm doing an experiment of profiling the time it takes to compute a single sqrt in C code. I have two strategies.
One is direct measurement of a single sqrt call and the other is to execute sqrt multiple times in a for loop and then compute the average. The C code is very simple and is shown as follows:
long long readTSC(void);
int main(int argc, char** argv)
{
int n = atoi(argv[1]);
//v is input of sqrt() making sure compiler won't
//precompute the result of sqrt(v) if v is constant
double v = atof(argv[2]);.
long long tm; //track CPU clock cycles
double x; //result of sqrt()
//-- strategy I ---
tm = readTSC(); //A function that uses rdtsc instruction to get the number of clock cycles from Intel CPU
x = sqrt(v);
tm = readTSC() - tm;
printf("x=%15.6\n",x); //make sure compiler won't optimize out the above sqrt()
printf("%lld clocks\n",tm);
double sum = 0.0;
int i;
//-- strategy II --
tm = readTSC();
for ( i = 0; i < n; i++ )
sum += sqrt((double) i);
tm = readTSC() - tm;
printf("%lld clocks\n",tm);
printf("%15.6e\n",sum);
return 0;
}
long long readTSC(void)
{
/* read the time stamp counter on Intel x86 chips */
union { long long complete; unsigned int part[2]; } ticks;
__asm__ ("rdtsc; mov %%eax,%0;mov %%edx,%1"
: "=mr" (ticks.part[0]),
"=mr" (ticks.part[1])
: /* no inputs */
: "eax", "edx");
return ticks.complete;
}
Before running the code, I expected that the timing result of strategy I might be slightly smaller than the one from strategy II, because strategy II also counts the overhead incurred by the for loop and the sum addition.
I use the following command without O3 optimization to compile my code on Intel Xeon E5-2680 2.7GHz machine.
gcc -o timing -lm timing.c
However, the result shows that strategy I takes about 40 clock cycles while strategy II takes average of 21.8 clock cycles, almost half of the former one.
For your reference, I also pasted the related assembly code below with some comments. It occurs to me, based on the timing result, that each for iteration executes two sqrt()'s. But I can hardly tell from the assembly code how the CPU can actually execute two sqrt()'s call in parallel?
call atof
cvtsi2ss %eax, %xmm0
movss %xmm0, -36(%rbp)
//-- timing single sqrt ---
call readTSC
movq %rax, -32(%rbp)
movss -36(%rbp), %xmm1
cvtps2pd %xmm1, %xmm1
//--- sqrtsd instruction
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L8
je .L4
.L8:
movapd %xmm1, %xmm0
//--- C function call sqrt()
call sqrt
.L4:
movsd %xmm0, -72(%rbp)
movq -72(%rbp), %rax
movq %rax, -24(%rbp)
call readTSC
//-- end of timing single sqrt ---
subq -32(%rbp), %rax
movq %rax, -32(%rbp)
movl $.LC0, %eax
movsd -24(%rbp), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
movl $.LC1, %eax
movq -32(%rbp), %rdx
movq %rdx, %rsi
movq %rax, %rdi
movl $0, %eax
call printf
movl $0, %eax
movq %rax, -16(%rbp)
call readTSC
//-- start of for loop----
movq %rax, -32(%rbp)
movl $0, -4(%rbp)
jmp .L5
.L6:
//(double) i
cvtsi2sd -4(%rbp), %xmm0
//-- C function call sqrt()
call sqrt
movsd -16(%rbp), %xmm1
//add sqrt(i) to sum (%xmm0)
addsd %xmm1, %xmm0
movsd %xmm0, -16(%rbp)
//i++
addl $1, -4(%rbp)
.L5:
movl -4(%rbp), %eax
//check i<n
cmpl -40(%rbp), %eax
jl .L6
//-- end of for loop--
//you can skip the rest of the part.
call readTSC
subq -32(%rbp), %rax
movq %rax, -32(%rbp)
movl $.LC1, %eax
movq -32(%rbp), %rdx
movq %rdx, %rsi
movq %rax, %rdi
movl $0, %eax
call printf
movl $.LC3, %eax
movsd -16(%rbp), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf

E5-2680 is a Sandy Bridge CPU and both the latency and the reciprocal throughput for SQRTSD is 10 to 21 cycles/instr. So in loop or not, you should measure something close to the observed 21.8 cycles. The sqrt function in GLIBC simply checks the sign of the argument and arranges for the non-negative branch to get executed speculatively via branch prediction, which in turn is a call to __ieee754_sqrt, which itself is simple inline assembly routine that on x86-64 systems emits sqrtsd %xmm0, %xmm0.
The CPU uses register renaming to handle data dependency. Thus it could have two copies of sqrtsd %xmm0, %xmm0 at different stages of execution in the pipeline. Since the result of sqrt is not needed immediately, other instructions could be executed while sqrt is being processed and that's why you measure only 21.8 cycles on average.
As for the larger value in the first case, RDTSC does not have the resolution of a single cycle. It has a certain latency, so you are basically measuring T_code_block + T_rdtsc_latency. In the second scenario, averaging over the iterations gives:
(T_code_block * n_iters + T_rdtsc_latency) / n_iters =
= T_code_block + (T_rdtsc_latency / n_iters)
For large n_iters, the second term vanishes and you get a very accurate measurement of a single iteration.
One has to be very careful when benchmarking with RDTSC. TSC itself ticks on modern CPUs at the reference clock speed. If the loop runs for long enough, it could trigger the core clock boost mode and the CPU will run faster, therefore one core clock cycle will correspond to less than one reference clock cycle. As a result, it will appear that instructions executed in boosted regions take less cycles that instructions executed in regions of nominal clock frequency.
Also, when performing cycle-accurate measurements, always pin the process to a single CPU core, either using the taskset utility or the sched_setaffinity(2) syscall. The OS scheduler usually moves processes around the different cores in order to keep them equally loaded and that is an expensive process. The probability that this could happen during the execution of a small region of several instructions is very low, but for long loops it is much higher. Averaging over many iterations could decrease the severity of the migration, but one would still get skewed results. Pinning the process to a single core prevents this altogether.

It looks to me that Strategy I, uses sqrtsd instruction. This is because ucomisd instruction does not set the parity flag and the code jumps directly to L4.
The for loop, Strategy II uses call sqrt to compute the square root. This may be an optimized version of sqrt, achieved through approximation, thus faster than the call to sqrtsd. Some compilers might do this optimization.
Even if the call sqrt uses in the back the sqrtsd instruction there is no reason for which it should run faster outside of the loop.
Please take note that measuring the latency of a single instruction executed only once is not deterministic. The rdtsc instruction has a latency of it's own and because nowadays CPUs are superscalar and out of order you cannot know that rdtsc sqrtsd rdtsc get executed completely in program order. They sure do not execute on the same port therefore the sqrtsd is not guaranteed to be complete at the time the second rdtsc completes.
Another important thing to consider is that when you execute sqrt in the loop, you decrease the average latency of the instructions. That's because you can have multiple instructions executing in parallel in different stages of the pipeline. Agner Fog's instruction tables shows that Ivy Bridge's sqrtsd instruction can have a throughput ranging from 1/8 to 1/14 instructions per cycle. The for loop increases the average throughput while the single isolated instruction will have the highest latency.
So because of the pipelined parallel execution, the instructions in the loop will run faster on average, than when running isolated.

The problem is with 'readTSC()' function. To make sure you may inter change 'strategy I' with 'strategy II'. Now you will see that 'Strategy II' has taken more time. I think readTSC() function need more time when it runs at the first time.

gcc inline assembly - operand type mismatch for `add', trying to create branchless code

I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?

Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.

Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).

You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO

Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.

Which operator is faster (> or >=), (< or <=)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is < cheaper (faster) than <=, and similarly, is > cheaper (faster) than >=?
Disclaimer: I know I could measure but that will be on my machine only and I am not sure if the answer could be "implementation specific" or something like that.

TL;DR
There appears to be little-to-no difference between the four operators, as they all perform in about the same time for me (may be different on different systems!). So, when in doubt, just use the operator that makes the most sense for the situation (especially when messing with C++).
So, without further ado, here is the long explanation:
Assuming integer comparison:
As far as assembly generated, the results are platform dependent. On my computer (Apple LLVM Compiler 4.0, x86_64), the results (generated assembly is as follows):
a < b (uses 'setl'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setl %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a <= b (uses 'setle'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setle %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a > b (uses 'setg'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setg %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a >= b (uses 'setge'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setge %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
Which isn't really telling me much. So, we skip to a benchmark:
And ladies & gentlemen, the results are in, I created the following test program (I am aware that 'clock' isn't the best way to calculate results like this, but it'll have to do for now).
#include <time.h>
#include <stdio.h>
#define ITERS 100000000
int v = 0;
void testL()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i < v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testLE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++)
{
v = i <= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testG()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i > v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testGE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i >= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
int main()
{
testL();
testLE();
testG();
testGE();
}
Which, on my machine (compiled with -O0), gives me this (5 separate runs):
testL: 337848
testLE: 338237
testG: 337888
testGE: 337787
testL: 337768
testLE: 338110
testG: 337406
testGE: 337926
testL: 338958
testLE: 338948
testG: 337705
testGE: 337829
testL: 339805
testLE: 339634
testG: 337413
testGE: 337900
testL: 340490
testLE: 339030
testG: 337298
testGE: 337593
I would argue that the differences between these operators are minor at best, and don't hold much weight in a modern computing world.

it varies, first start at examining different instruction sets and how how the compilers use those instruction sets. Take the openrisc 32 for example, which is clearly mips inspired but does conditionals differently. For the or32 there are compare and set flag instructions, compare these two registers if less than or equal unsigned then set the flag, compare these two registers if equal set the flag. Then there are two conditional branch instructions branch on flag set and branch on flag clear. The compiler has to follow one of these paths, but less, than, less than or equal, greater than, etc are all going to use the same number of instructions, same execution time for a conditional branch and same execution time for not doing the conditional branch.
Now it is definitely going to be true for most architectures that performing the branch takes longer than not performing the branch because of having to flush and re-fill the pipe. Some do branch prediction, etc to help with that problem.
Now some architectures the size of the instruction may vary, compare gpr0 and gpr1 vs compare gpr0 and the immediate number 1234, may require a larger instruction, you will see this a lot with x86 for example. so although both cases may be a branch if less than how you encode the less depending on what registers happen to hold what values can make a performance difference (sure x86 does a lot of pipelining, lots of caching, etc to make up for these issues). Another similar example is mips and or32, where r0 is always a zero, it is not really a general purpose register, if you write to it it doesnt change, it is hardwired to a zero, so a compare if equal to 0 MIGHT cost you more than a compare if equal to some other number if an extra instruction or two is required to fill a gpr with that immediate so that the compare can happen, worst case is having to evict a register to the stack or memory, to free up the register to put the immediate in there so that the compare can happen.
Some architectures have conditional execution like arm, for the full arm (not thumb) instructions you can on a per instruction basis execute, so if you had code
if(i==7) j=5; else j=9;
the pseudo code for arm would be
cmp i,#7
moveq j,#5
movne j,#7
there is no actual branch, so no pipeline issues you flywheel right on through, very fast.
One architecture to another if that is an interesting comparison some as mentioned, mips, or32, you have to specifically perform some sort of instruction for the comparision, others like x86, msp430 and the vast majority each alu operation changes the flags, arm and the like change flags if you tell it to change flags otherwise dont as shown above. so a
while(--len)
{
//do something
}
loop the subtract of 1 also sets the flags, if the stuff in the loop was simple enough you could make the whole thing conditional, so you save on separate compare and branch instructions and you save in the pipeline penalty. Mips solves this a little by compare and branch are one instruction, and they execute one instruction after the branch to save a little in the pipe.
The general answer is that you will not see a difference, the number of instructions, execuition time, etc are the same for the various conditionals. special cases like small immediates vs big immediates, etc may have an effect for corner cases, or the compiler may simply choose to do it all differently depending on what comparison you do. If you try to re-write your algorithm to have it give the same answer but use a less than instead of a greater than and equal, you could be changing the code enough to get a different instruction stream. Likewise if you perform too simple of a performance test, the compiler can/will optimize out the comparison complete and just generate the results, which could vary depending on your test code causing different execution. The key to all of this is disassemble the things you want to compare and see how the instructions differ. That will tell you if you should expect to see any execution differences.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight