As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is < cheaper (faster) than <=, and similarly, is > cheaper (faster) than >=?
Disclaimer: I know I could measure but that will be on my machine only and I am not sure if the answer could be "implementation specific" or something like that.
TL;DR
There appears to be little-to-no difference between the four operators, as they all perform in about the same time for me (may be different on different systems!). So, when in doubt, just use the operator that makes the most sense for the situation (especially when messing with C++).
So, without further ado, here is the long explanation:
Assuming integer comparison:
As far as assembly generated, the results are platform dependent. On my computer (Apple LLVM Compiler 4.0, x86_64), the results (generated assembly is as follows):
a < b (uses 'setl'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setl %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a <= b (uses 'setle'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setle %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a > b (uses 'setg'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setg %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a >= b (uses 'setge'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setge %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
Which isn't really telling me much. So, we skip to a benchmark:
And ladies & gentlemen, the results are in, I created the following test program (I am aware that 'clock' isn't the best way to calculate results like this, but it'll have to do for now).
#include <time.h>
#include <stdio.h>
#define ITERS 100000000
int v = 0;
void testL()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i < v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testLE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++)
{
v = i <= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testG()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i > v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testGE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i >= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
int main()
{
testL();
testLE();
testG();
testGE();
}
Which, on my machine (compiled with -O0), gives me this (5 separate runs):
testL: 337848
testLE: 338237
testG: 337888
testGE: 337787
testL: 337768
testLE: 338110
testG: 337406
testGE: 337926
testL: 338958
testLE: 338948
testG: 337705
testGE: 337829
testL: 339805
testLE: 339634
testG: 337413
testGE: 337900
testL: 340490
testLE: 339030
testG: 337298
testGE: 337593
I would argue that the differences between these operators are minor at best, and don't hold much weight in a modern computing world.
it varies, first start at examining different instruction sets and how how the compilers use those instruction sets. Take the openrisc 32 for example, which is clearly mips inspired but does conditionals differently. For the or32 there are compare and set flag instructions, compare these two registers if less than or equal unsigned then set the flag, compare these two registers if equal set the flag. Then there are two conditional branch instructions branch on flag set and branch on flag clear. The compiler has to follow one of these paths, but less, than, less than or equal, greater than, etc are all going to use the same number of instructions, same execution time for a conditional branch and same execution time for not doing the conditional branch.
Now it is definitely going to be true for most architectures that performing the branch takes longer than not performing the branch because of having to flush and re-fill the pipe. Some do branch prediction, etc to help with that problem.
Now some architectures the size of the instruction may vary, compare gpr0 and gpr1 vs compare gpr0 and the immediate number 1234, may require a larger instruction, you will see this a lot with x86 for example. so although both cases may be a branch if less than how you encode the less depending on what registers happen to hold what values can make a performance difference (sure x86 does a lot of pipelining, lots of caching, etc to make up for these issues). Another similar example is mips and or32, where r0 is always a zero, it is not really a general purpose register, if you write to it it doesnt change, it is hardwired to a zero, so a compare if equal to 0 MIGHT cost you more than a compare if equal to some other number if an extra instruction or two is required to fill a gpr with that immediate so that the compare can happen, worst case is having to evict a register to the stack or memory, to free up the register to put the immediate in there so that the compare can happen.
Some architectures have conditional execution like arm, for the full arm (not thumb) instructions you can on a per instruction basis execute, so if you had code
if(i==7) j=5; else j=9;
the pseudo code for arm would be
cmp i,#7
moveq j,#5
movne j,#7
there is no actual branch, so no pipeline issues you flywheel right on through, very fast.
One architecture to another if that is an interesting comparison some as mentioned, mips, or32, you have to specifically perform some sort of instruction for the comparision, others like x86, msp430 and the vast majority each alu operation changes the flags, arm and the like change flags if you tell it to change flags otherwise dont as shown above. so a
while(--len)
{
//do something
}
loop the subtract of 1 also sets the flags, if the stuff in the loop was simple enough you could make the whole thing conditional, so you save on separate compare and branch instructions and you save in the pipeline penalty. Mips solves this a little by compare and branch are one instruction, and they execute one instruction after the branch to save a little in the pipe.
The general answer is that you will not see a difference, the number of instructions, execuition time, etc are the same for the various conditionals. special cases like small immediates vs big immediates, etc may have an effect for corner cases, or the compiler may simply choose to do it all differently depending on what comparison you do. If you try to re-write your algorithm to have it give the same answer but use a less than instead of a greater than and equal, you could be changing the code enough to get a different instruction stream. Likewise if you perform too simple of a performance test, the compiler can/will optimize out the comparison complete and just generate the results, which could vary depending on your test code causing different execution. The key to all of this is disassemble the things you want to compare and see how the instructions differ. That will tell you if you should expect to see any execution differences.
Related
Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang or gcc are able to do this at the moment.
Clang is able to get lookup0 if you rewrite it as a switch/case statement.
lookup0(unsigned int): # #lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # #case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1.
lookup1(unsigned int): # #lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # #case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.
The best "generic" solution I am aware of is the following:
int compute(int r)
{
static const int T[] = {0,0,1,1};
const int lut_size = sizeof(T) / sizeof(T[0]);
int result = 0;
for(int i=0 ; i<lut_size ; ++i)
result += (r == i) * T[i];
return result;
}
In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:
int compute(int r)
{
return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}
GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
compute:
xor ecx, ecx
cmp edi, 2
sete cl
xor eax, eax
cmp edi, 3
sete al
add eax, ecx
ret
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.
A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.
You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table {2,0,3,1} could be handled with:
uint32_t lookup0(uint32_t r) {
static const uint32_t tbl = (2u << 0) | (0u << 8) |
(3u << 16) | (1u << 24);
return (tbl >> (8 * r)) & 0xff;
}
It produces relatively nice assembly:
lookup0: # #lookup0
lea ecx, [8*rdi]
mov eax, 16973826
shr eax, cl
movzx eax, al
ret
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.
In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other):
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits
The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).
The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)
Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.
It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.
I'm trying to figure out to convert this x86 assembly code to Y86 form:
Given the c program:
int sum(int x) {
if (x == 0 || x ==1) {
return 1;
} else {
return x + sum(x-1);
}
}
The following x86-64 assembly code is generated:
sum:
cmpl $1, %rdi
ja .L8
movl $1, %eax
ret
.L8:
pushq %rbx
movl %edi, %ebx
leal -1(%rdi), %edi
call sum
addl %ebx, %eax
popq %rbx
ret
How can I convert this to Y86-64 assembly code that does the same thing?
Thank you!
In this case, you can convert by replacing each instruction with a short sequence of y86 instructions which does exactly the same thing.
y86 is Turing complete, but very crippled, so in general you can't always easily convert. Some single x86 instructions might need an entire loop or very long function to implement, but that's not the case for any of your instructions. Each of them can be transliterated to one or a few y86 instructions. (Some might need a scratch register; I forget if y86 has compare with immediate or only mov-immediate to register.)
Your code doesn't have any multiplies, shifts, or bsf, or floating-point, or anything else that y86 doesn't have (and would need a loop to emulate).
Look up each x86 instruction in the instruction-set reference manual (like this online version, or this older one where not having AVX/AVX2 instructions means less to wade through. See also the x86 tag wiki for links to Intel and AMD's PDF manuals.) Look at the Operation section where pseudo-code describes the exact effect of the instruction on the architectural state. That's the behaviour you want to implement using y86 instructions.
As an example, I forget if y86 has push / pop, but if not you can always manipulate rsp directly and load/store. e.g. sub $8, %rsp ; movrm %rbx, (rsp) is push (except it clobbers flags where x86's push doesn't).
While fiddling with simple C code, I noticed something strange. Why does ICC produces incl %eax in assembly code generated for increment instead of addl $1, %eax? GCC behaves as expected though, using add.
Example code (-O3 used on both GCC and ICC)
int A, B, C, D, E;
void foo()
{
A = B + 1;
B = 0;
C++;
D++;
D++;
E += 2;
}
Result on ICC
L__routine_start_foo_0:
foo:
movl B(%rip), %eax #5.13
movl D(%rip), %edx #8.9
incl %eax #5.17
movl E(%rip), %ecx #10.9
addl $2, %edx #9.9
addl $2, %ecx #10.9
movl %eax, A(%rip) #5.9
movl $0, B(%rip) #6.9
incl C(%rip) #7.9
movl %edx, D(%rip) #9.9
movl %ecx, E(%rip) #10.9
ret
For example, see here.
As such, I'm wondering - is this an intended feature, a bug or some quirk resulting from some specific setting? If add is (supposedly) better due to flags update or efficiency (which is the conclusion based on the links below) - why does ICC use inc?
Related:
Relative performance of x86 inc vs. add instruction
Is ADD 1 really faster than INC ? x86
GCC doesn't make use of inc
Note:
I'm asking this question explicitly because none of the questions I found or was directed to on SO does explain this behaviour. My previous question concerning this matter got closed because, supposedly, it's trivial and has been answered. I don't find it trivial. I didn't find an answer in all of the links and answers given. It's not another "how to plug my mouse into my PC" problem. All of the questions explain why add is/could be better on new x86 processors or why GCC uses it, but none concerns ICC.
Any insight on ICC design choices would be also very welcome.
PS I don't consider "it does it because it does" a valid answer.
It is not unreasonable to assume at this point that incl was selected as it takes only one byte (0x40) instead of three (0x83 0xc0 0x01).
I have code that is doing a lot of these comparison operations. I was wondering which is the most efficient one to use. Is it likely that the compiler will correct it if I intentionally choose the "wrong" one?
int a, b;
// Assign a value to a and b.
// Now check whether either is zero.
// The worst?
if (a * b == 0) // ...
// The best?
if (a & b == 0) // ...
// The most obvious?
if (a == 0 || b == 0) // ...
Other ideas?
In general, if there's a fast way of doing a simple thing, you can assume the compiler will do it that fast way. And remember that the compiler is outputting machine language, not C -- the fastest method probably can't be correctly represented as a set of C constructs.
Also, the third method there is the only one that always works. The first one fails if a and b are 1<<16, and the second you already know doesn't work.
It's possible to see which variant generates fewer assembly instructions, but it's a separate matter to see which one actually executes in less time.
To help you analyze the first matter, learn to use your C compiler's command-line flags to capture its intermediate output. GCC is a common choice for a C compiler. Let's look at its unoptimized assembly code for two different programs.
#include <stdio.h>
void report_either_zero()
{
int a = 1;
int b = 0;
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Save that text to a file such as zero-test.c, and run the following command:
gcc -S zero-test.c
GCC will emit a file called zero-test.s, which is the assembly code it would normally submit to the assembler as it generates object code.
Let's look at the relevant fragment of the assembly code. I'm using gcc version 4.2.1 on Mac OS X generating x86 64-bit instructions.
_report_either_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $32, %rsp
Ltmp2:
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $1, -20(%rbp) // a = 1
movl $0, -24(%rbp) // b = 0
movl -24(%rbp), %eax // Get ready to compare a.
cmpl $0, %eax // Does zero equal a?
je LBB1_2 // If so, go to label LBB1_2.
movl -24(%rbp), %eax // Otherwise, get ready to compare b.
cmpl $0, %eax // Does zero equal b?
jne LBB1_3 // If not, go to label LBB1_3.
LBB1_2:
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_3:
addq $32, %rsp
popq %rbp
ret
Leh_func_end1:
You can see where the we load the integer values 1 and 0 into registers, then prepare to compare the first to zero, and then again with the second if the first is nonzero.
Now let's try a different approach with the comparison, to see how the assembly code changes. Note that this is not the same predicate; this one checks whether both numbers are zero.
#include <stdio.h>
void report_both_zero()
{
int a = 1;
int b = 0;
if (!(a | b))
{
puts("Both of them are zero.");
}
}
The assembly code is a little different:
_report_both_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
movl $1, -4(%rbp) // a = 1
movl $0, -8(%rbp) // b = 0
movl -4(%rbp), %eax // Get ready to operate on a.
movl -8(%rbp), %ecx // Get ready to operate on b too.
orl %ecx, %eax // Combine a and b via bitwise OR.
cmpl $0, %eax // Does zero equal the result?
jne LBB1_2 // If not, go to label LBB1_2.
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_2:
addq $16, %rsp
popq %rbp
ret
Leh_func_end1:
If the first number is zero, the first variant does less work—in terms of the number of assembly instructions involved—by avoiding a second register move. If the first number is not zero, the second variant does less work by avoiding a second comparison to zero.
The question now is whether "move, move, bitwise or, compare" runs faster that "move, compare, move, compare." The answer could come down to things like whether the processor learns to predict how often the first integer is zero, and whether it is or not consistently.
If you ask the compiler to optimize this code, the example is too simple; the compiler decides at compile time that no comparison is necessary, and just condenses that code to an unconditional request to write the string. It's interesting to change the code to operate on parameters rather than constants, and see how the optimizer handles the situation differently.
Variant one:
#include <stdio.h>
void report_either_zero(int a, int b)
{
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Variant two (again, a different predicate):
#include <stdio.h>
void report_both_zero(int a, int b)
{
if (!(a | b))
{
puts("Both of them are zero.");
}
}
Generate the optimized assembly code with this command:
gcc -O -S zero-test.c
Let us know what you find.
This will not likely have much (if any, given modern compiler optimizers) effect on the overall performance of your app. If you really must know, you should write some code to test the performance of each for your compiler. However, as a best guess, I'd say...
if ( !( a && b ) )
This will short-circuit if the first happens to be 0.
If you want to find whether or not one of two integers is zero using one comparison instruction ...
if ((a << b) == a)
If a is zero, then no amount of shifting it to the left will change its value.
If b is zero, then there is no shifting performed.
It is possible (I am too lazy to check) that there is some undefined behaviour should b be negative or really large.
However, due to the non-intuitiveness, it would be strongly recommended to implement this as a macro (with an appropriate comment).
Hope this helps.
The most efficient is certainly the most obvious, if by efficiency, you are measuring the programmer's time.
If by measuring efficiency using the processor's time, profiling your candidate solution is the best way to answer - for the target machine you profiled.
But this exercise demonstrated a pitfall of programmer optimization. The 3 candidates are not functionally equivalent for all int.
If you was a functional equivalent alternative...
I think the last candidate and a 4th one deserve comparison.
if ((a == 0) || (b == 0))
if ((a == 0) | (b == 0))
Due to the variation of compilers, optimization and CPU branch prediction, one should profile, rather than pontificate, to determine relative performance. OTOH, a good optimizing compiler may give you the same code for both.
I recommend the code that is easiest to maintain.
There's no "most efficient way to do it in C", if by "efficiency" you mean the efficiency of the compiled code.
Firstly, even if we assume that the compiler translates C language operator into their "obvious" machine counterparts (i.e. C multiplication into machine multiplication etc) the efficiency of each method will differ from one hardware platform to the other. Even if we restrict our consideration to a very specific sequence of instructions on a very specific hardware platform, it still can exhibit different performance in different surrounding contexts, depending, for example, on how well the whole thing agrees with the branch prediction heuristic in the given CPU.
Secondly, modern C compilers rarely translate C operators into their "obvious" machine counterparts. Often the instructions used in machine code will have very little in common with the C code. It is possible that many "completely different" methods of performing the check at C level will actually be translated into the same sequence of machine instructions by a smart compiler. At the same time the same C code might get translated into different sequences machine instructions when the surrounding contexts are different.
In other words, there's no meaningful answer to your question, unless you really really really localize it to a specific hardware platform, specific compiler version and specific set of compilation settings. And that will make it too localized to be useful.
That usually means that in general case the best way to do it is to write the most readable code. Just do
if (a == 0 || b == 0)
The readability of the code will not only help the human reader to understand it, but will also increase the probability of the compiler properly interpreting your intent and generating the most optimal code.
But if you really have to squeeze the last CPU cycle out of your performance-critical code, you have to try different versions and compare their relative efficiency manually.