Assembly: CMOVB instruction in Intel x86-64 assembly - c

I'm a little confused about what "cmovb" does in this assembly code
leal (%rsi, %rsi), %eax // %eax <- %rsi + %rsi
cmpl %esi, %edi // compare %edi and %esi
cmovb %edi, %eax
ret
and the C code for this is:
int foo(unsigned int a, unsigned int b)
{
if(a < b)
return a;
else
return 2*b;
}
Can anyone help me understand how cmovb works here?

Like Jester commented to the question, the cmov* family of instructions are conditional moves, paired via the flags register with a previous (comparison) operation.
You can use for example the Intel documentation as a reference for the x86-64/AMD64 instruction set. The conditional move instructions are shown on page 172 of the combined volume.
cmovb, cmovnae, and cmovc all perform the same way: If the carry flag is set, they move the source operand to the destination operand. Otherwise they do nothing.
If we then look at the preceding instructions that affect flags, we'll see that the cmp instruction (the l suffix is part of AT&T syntax, and means the arguments are "longs") changes the set of flags depending on the difference between the two arguments. In particular, if the second is smaller than the first (in AT&T syntax), the carry flag is set, otherwise the carry flag is cleared; just as if a subtraction was performed without storing the result anywhere. (The cmp instruction affects other flags as well, but they are ignored by the code.)

C MOV B = Conditional MOVe if Below (Carry Flag Set). It literally does what it says, if the condition is met then move. The condition is a<b and the value moved is 2*b
The ABI stores the return value in %edi, so it first stores a and then conditionally overwrites it with 2*b.

Related

how to derive the types of the following data types from assembly code

I came across an exercise, as I am still trying to familiarise myself with assembly code.
I am unsure how to derive the types for a given struct, given the assembly code and the skeleton c code. Could someone teach me how this should be done?
This is the assembly code, where rcx and rdx hold the arguments i and j respectively.
randFunc:
movslq %ecx,%rcx // move i into rcx
movslq %edx, %rdx // move j into rdx
leaq (%rcx,%rcx,2), %rax //3i into rax
leaq (%rdx,%rdx,2), %rdx // 3j into rdx
salq $5, %rax // shift arith left 32? so 32*3i = 96i
leaq (%rax,%rdx,8), %rax //24j + 96i into rax
leaq matrixtotest(%rip), %rdx //store address of the matrixtotest in rdx
addq %rax, %rdx //jump to 24th row, 6th column variable
cmpb $10, 2(%rdx) //add 2 to that variable and compare to 10
jg .L5 //if greater than 10 then go to l5
movq 8(%rdx), %rax // else add 8 to the rdx number and store in rax
movzwl (%rdx), %edx //move the val in rdx (unsigned) to edx as an int
subl %edx, %eax //take (val+8) -(val) = 8? (not sure)
ret
.L5
movl 16(%rdx),%eax //move 1 row down and return? not sure about this
ret
This is the C code:
struct mat{
typeA a;
typeB b;
typeC c;
typeD d;
}
struct mat matrixtotest[M][N];
int randFunc(int i, int j){
return __1__? __2__ : __3__;
}
How do I derive the types of the variables a,b,c,d? And what is happening in the 1) 2) 3) parts of the return statement ?
Please help me, I'm very confused about what's happening and how to derive the types of the struct from this assembly.
Any help is appreciated, thank you.
Due to the cmpb $10, 2(%rdx) you have a byte sized something at offset 2. Due to the movzwl (%rdx), %edx you have a 2 byte sized unsigned something at offset 0. Due to the movq 8(%rdx), %rax you have a 8 byte sized something at offset 8. Finally due to the movl 16(%rdx),%eax you have a 4 byte sized something at offset 16. Now sizes don't map to types directly, but one possibility would be:
struct mat{
uint16_t a;
int8_t b;
int64_t c;
int32_t d;
};
You can use unsigned short, signed char, long, int if you know their sizes.
The size of the structure is 24 bytes, with padding at the end due to alignment requirement of the 8 byte field. From the 96i you can deduce N=4 probably. M is unknown. As such 24j + 96i accesses item matrixtotest[i][j]. The rest should be clear.
How do I derive the types of the variables a,b,c,d?
You want to see how variables are used, which will give you a very strong indication as to their size & sign.  (These indications are not always perfect, but the best we can do with limited information, i.e. missing source code, and will suffice for your exercise.)
So, just work the code, one instruction after another to see what they do by the definitions they have in the assembler and their mapping to the instruction set, paying particular attention to the sizes, signs, and offsets specified by the instructions.
Let's start for example with the first instruction: movslq ecx, rcx — this is saying that the first parameter (which is found in ecx), is a 32-bit signed number.
Since rcx is Windows ABI first parameter, and the assembly code is asking for ecx to be sign extended into rcx, then we know that this parameter is a signed 32-bit integer.  And you proceed to the next instruction, to glean what you can from it — and so on.
And what is happening in the 1) 2) 3) parts of the return statement ?
The ?: operator is a ternary operator known as a conditional.  If the condition, placeholder __1__, is true, it will choose the __2__ value and if false it will choose __3__.  This is usually (but not always) organized as an if-then-else branching pattern, where the then-part represents placeholder __2__ and the else part placeholder __3__.
That if-then-else branching pattern looks something like this in assembly/machine code:
if <condition> /* here __1__ */ is false goto elsePart;
<then-part> // here __2__
goto ifDone;
elsePart:
<else-part> // here __3__
ifDone:
So, when you get to an if-then-else construct, you can fit that into the ternary operator place holders.
That code is nicely commented, but somewhat absent size, sign, and offset information.  So, following along and derive that missing information from the way the instructions tell the CPU what sizes, signs, and offsets to use.
As Jester describes, if the code indexes into the array, because it is two-dimensional, it uses two indexes.  The indexing takes the given indexes and computes the address of the element.  As such, the first index finds the row, and so must skip ahead one row for each value in the index.  The second index must skip ahead one element for each value in the index.  Thus, by the formula in the comments: 24j + 96i, we can say that a row is 96 bytes long and an element (the struct) is 24 bytes long.

How do I translate an optimized x86-64 asm loop back to a C for loop?

I have the following:
foo:
movl $0, %eax //result = 0
cmpq %rsi, %rdi // rdi = x, rsi = y?
jle .L2
.L3:
addq %rdi, %rax //result = result + i?
subq $1, %rdi //decrement?
cmp %rdi, rsi
jl .L3
.L2
rep
ret
And I'm trying to translate it to:
long foo(long x, long y)
{
long i, result = 0;
for (i= ; ; ){
//??
}
return result;
}
I don't know what cmpq %rsi, %rdi mean.
Why isn't there another &eax for long i?
I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.
Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.
long foo(long x /*rdi*/, long y /*rsi*/)
{
//movl $0, %eax
long result = 0; /* rax */
//cmpq %rsi, %rdi
//jle .L2
if (x > y) {
do {
//addq %rdi, %rax
result += x;
//subq $1, %rdi
--x;
//cmp %rdi, rsi
//jl .L3
} while (x > y);
}
return result;
}
I don't know what cmpq %rsi, %rdi mean
That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html
You can look up the details of what a single instruction does in an ISA manual.
More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi.
Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.
(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)
From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).
So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).
From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (#BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)
From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)
Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.
I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.
cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.
The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.
sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.
addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.
Putting the pieces together, we arrive at:
long foo(long x, long y)
{
long i, result = 0;
for (i= x ; i>y ; i-- ){
result += i;
}
return result;
}
Which compiler made this code?
From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)
So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.
The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?
gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)
gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.
gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.
gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer
# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
movl $0, %eax #, result
cmpq %rsi, %rdi # y, x
jle .L2 #,
.L3:
addq %rdi, %rax # i, result
subq $1, %rdi #, i
cmpq %rdi, %rsi # i, y
jl .L3 #,
.L2:
rep
ret
So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)
// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
long i, result = 0;
for (i= y ; x>i ; x-- ){
result += x;
}
return result;
}
In practice the way I actually reversed this:
I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.
If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.
(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)

Understanding the difference between ++i and i++ at the Assembly Level

I know that variations of this question has been asked here multiple times, but I'm not asking what is the difference between the two. Just would like some help understanding the assembly behind both forms.
I think my question is more related to the whys than to the what of the difference.
I'm reading Prata's C Primer Plus and in the part dealing with the increment operator ++ and the difference between using i++ or ++i the author says that if the operator is used by itself, such as ego++; it doesn't matter which form we use.
If we look at the dissasembly of the following code (compiled with Xcode, Apple LLVM version 9.0.0 (clang-900.0.39.2)):
int main(void)
{
int a = 1, b = 1;
a++;
++b;
return 0;
}
we can see that indeed the form used doesn't matter, since the assembly code is the same for both (both variables would print out a 2 to the screen).
Initializaton of a and b:
0x100000f8d <+13>: movl $0x1, -0x8(%rbp)
0x100000f94 <+20>: movl $0x1, -0xc(%rbp)
Assembly for a++:
0x100000f9b <+27>: movl -0x8(%rbp), %ecx
0x100000f9e <+30>: addl $0x1, %ecx
0x100000fa1 <+33>: movl %ecx, -0x8(%rbp)
Assembly for ++b:
0x100000fa4 <+36>: movl -0xc(%rbp), %ecx
0x100000fa7 <+39>: addl $0x1, %ecx
0x100000faa <+42>: movl %ecx, -0xc(%rbp)
Then the author states that when the operator and its operand are part of a larger expression as, for example, in an assignment statement the use of prefix or postfix it does make a difference.
For example:
int main(void)
{
int a = 1, b = 1;
int c, d;
c = a++;
d = ++b;
return 0;
}
This would print 1 and 2 for c and b, respectively.
And:
Initialization of a and b:
0x100000f46 <+22>: movl $0x1, -0x8(%rbp)
0x100000f4d <+29>: movl $0x1, -0xc(%rbp)
Assembly for c = a++; :
0x100000f54 <+36>: movl -0x8(%rbp), %eax // eax = a = 1
0x100000f57 <+39>: movl %eax, %ecx // ecx = 1
0x100000f59 <+41>: addl $0x1, %ecx // ecx = 2
0x100000f5c <+44>: movl %ecx, -0x8(%rbp) // a = 2
0x100000f5f <+47>: movl %eax, -0x10(%rbp) // c = eax = 1
Assembly for d = ++b; :
0x100000f62 <+50>: movl -0xc(%rbp), %eax // eax = b = 1
0x100000f65 <+53>: addl $0x1, %eax // eax = 2
0x100000f68 <+56>: movl %eax, -0xc(%rbp) // b = eax = 2
0x100000f6b <+59>: movl %eax, -0x14(%rbp) // d = eax = 2
Clearly the assembly code is different for the assignments:
The form c = a++; includes the use of the registers eax and ecx. It uses ecx for performing the increment of a by 1, but uses eax for the assignment.
The form d = ++b; uses ecx for both the increment of b by 1 and the assignment.
My question is:
Why is that?
What determines that c = a++; requires two registers instead of just one (ecx for example)?
In the following statements:
a++;
++b;
neither of the evaluation of the expressions a++ and ++b is used. Here the compiler is actually only interested in the side effects of these operators (i.e.: incrementing the operand by one). In this context, both operators behave in the same way. So, it's no wonder that these statements result in the same assembly code.
However, in the following statements:
c = a++;
d = ++b;
the evaluation of the expressions a++ and ++b is relevant to the compiler because they have to be stored in c and d, respectively:
d = ++b;: b is incremented and the result of this increment assigned to d.
c = a++; : the value of a is first assigned to c and then a is incremented.
Therefore, these operators behave differently in this context. So, it would make sense to result in different assembly code, at least in the beginning, without more aggressive optimizations enabled.
A good compiler would replace this whole code with c = 1; d = 2;. And if those variables aren't used in turn, the whole program is one big NOP - there should be no machine code generated at all.
But you do get machine code, so you are not enabling the optimizer correctly. Discussing the efficiency of non-optimized C code is quite pointless.
Discussing a particular compiler's failure to optimize the code might be meaningful, if a specific compiler is mentioned. Which isn't the case here.
All this code shows is that your compiler isn't doing a good job, possibly because you didn't enable optimizations, and that's it. No other conclusions can be made. In particular, no meaningful discussion about the behavior of i++ versus ++i is possible.
Your test has flaws : the compiler optimized your code by replacing your value with what could be easily predicted.
The compiler can, and will, calculate the result in advance during compilation and avoid the use of 'jmp' instructions (jump to the the while each time condition is still true).
If you try this code:
int a = 0;
int i = 0;
while (i++ < 10)
{
a += i;
}
The assembly will not use a single jmp instruction.
It will directly assign value of ½ n (n + 1), here (0.5 * 10 * 6) = 30 to the register holding the value of 'a' variable
You would have the following assembly output:
mov eax, 30 ; a register
mov ecx, 10 ; i register, this line only if i is still used after.
Whether you write :
int i = 0;
while (i++ < 10)
{
...
}
or
int i = -1;
while (++i < 11)
{
...
}
will also result in the same assembly output.
If you had a much more complex code you would be able to witness differences in the assembly code.
a = ++i;
would translate into :
inc rcx ; increase i by 1, RCX holds the current value of both and i variables.
mov rax, rcx ; a = i;
and a = i++; into :
lea rax, [rcx+1] ; RAX now holds i, RCX now holds a.
mov rax, rcx ; a = i;
inc rcx ; increase i by 1
(edit: See comment below)
Both the expressions ++i and i++ have the effect of incrementing i. The difference is that ++i produces a result (a value stored somewhere, for example in a machine register, that can be used within other expressions) equal to the new value of i, whereas i++ produces a result equal to the original value of i.
So, assuming we start with i having a value of 2, the statement
b = ++i;
has the effect of setting both b and i equal to 3, whereas;
b = i++;
has the effect of setting b equal to 2 and i equal to 3.
In the first case, there is no need to keep track of the original value of i after incrementing i whereas in the second there is. One way of doing this is for the compiler to employ an additional register for i++ compared with ++i.
This is not needed for a trivial expression like
i++;
since the compiler can immediately detect that the original value of i will not be used (i.e. is discarded).
For simple expressions like b = i++ the compiler could - in principle at least - avoid using an additional register, by simply storing the original value of i in b before incrementing i. However, in slightly more complex expressions such as
c = i++ - *p++; // p is a pointer
it can be much more difficult for the compiler to eliminate the need to store old and new values of i and p (unless, of course, the compiler looks ahead and determines how (or if) c, i, and p (and *p) are being used in subsequent code). In more complex expressions (involving multiple variables and interacting operations) the analysis needed can be significant.
It then comes down to implementation choices by developers/designers of the compiler. Practically, compiler vendors compete pretty heavily on compilation time (getting compilation times as small as possible) and, in doing so, may choose not to do all possible code transformations that remove unneeded uses of temporaries (or machine registers).
You compiled with optimization disabled! For gcc and LLVM, that means each C statement is compiled independently, so you can modify variables in memory with a debugger, and even jump to a different source line. To support this, the compiler can't optimize between C statements at all, and in fact spills / reloads everything between statements.
So the major flaw in your analysis is that you're looking at an asm implementation of that statement where the inputs and outputs are memory, not registers. This is totally unrealistic: compilers keep most "hot" values in registers inside inner loops, and don't need separate copies of a value just because it's assigned to multiple C variables.
Compilers generally (and LLVM in particular, I think) transform the input program into an SSA (Static Single Assignment) internal representation. This is how they track data flow, not according to C variables. (This is why I said "hot values", not "hot variables". A loop induction variable might be totally optimized away into a pointer-increment / compare against end_pointer in a loop over arr[i++]).
c = ++i; produces one value with 2 references to it (one for c, one for i). The result can stay in a single register. If it doesn't optimize into part of some other operation, the asm implementation could be as simple as inc %ecx, with the compiler just using ecx/rcx everywhere that c or i is read before the next modification of either. If the next modification of c can't be done non-destructively (e.g. with a copy-and-modify like lea (,%rcx,4), %edx or shrx %eax, %ecx, %edx), then a mov instruction to copy the register will be emitted.
d = b++; produces one new value, and makes d a reference to the old value of b. It's syntactic sugar for d=b; b+=1;, and compiles into SSA the same as that would. x86 has a copy-and-add instruction, called lea. The compiler doesn't care which register holds which value (except in loops, especially without unrolling, when the end of the loop has to have values in the right registers to jump to the beginning of the loop). But other than that, the compiler can do lea 1(%rbx), %edx to leave %ebx unmodified and make EDX hold the incremented value.
An additional minor flaw in your test is that with optimization disabled, the compiler is trying to compile quickly, not well, so it doesn't look for all possible peephole optimizations even within the statement that it does allow itself to optimize.
If the value of c or d is never read, then it's the same as if you had never done the assignment in the first place. (In un-optimized code, every value is implicitly read by the memory barrier between statements.)
What determines that c = a++; requires two registers instead of just one (ecx for example)?
The surrounding code, as always. +1 can be optimized into other operations, e.g. done with an LEA as part of a shift and/or add. Or built in to an addressing mode.
Or before/after negation, use the 2's complement identity that -x == ~x+1, and use NOT instead of NEG. (Although often you're adding the negated value to something, so it turns into a SUB instead of NEG + ADD, so there isn't a stand-alone NEG you can turn into a NOT.)
++ prefix or postfix is too simple to look at on its own; you always have to consider where the input comes from (does the incremented value have to end up back in memory right away or eventually?) and how the incremented and original values are used.
Basically, un-optimized code is un-interesting. Look at optimized code for short functions. See Matt Godbolt's talk at CppCon2017: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”, and also How to remove "noise" from GCC/clang assembly output? for more about looking at compiler asm output.

Understanding gcc output for if (a>=3)

I thought since condition is a >= 3, we should use jl (less).
But gcc used jle (less or equal).
It make no sense to me; why did the compiler do this?
You're getting mixed up by a transformation the compiler made on the way from the C source to the asm implementation. gcc's output implements your function this way:
a = 5;
if (a<=2) goto ret0;
return 1;
ret0:
return 0;
It's all clunky and redundant because you compiled with -O0, so it stores a to memory and then reloads it, so you could modify it with a debugger if you set a breakpoint and still have the code "work".
See also How to remove "noise" from GCC/clang assembly output?
Compilers generally prefer to reduce the magnitude of a comparison constant, so it's more likely to fit in a sign-extended 8-bit immediate instead of needing a 32-bit immediate in the machine code.
We can get some nice compact code by writing a function that takes an arg, so it won't optimize away when we enable optimizations.
int cmp(int a) {
return a>=128; // In C, a boolean converts to int as 0 or 1
}
gcc -O3 on Godbolt, targetting the x86-64 ABI (same as your code):
xorl %eax, %eax # whole RAX = 0
cmpl $127, %edi
setg %al # al = (edi>127) : 1 : 0
ret
So it transformed a >=128 into a >127 comparison. This saves 3 bytes of machine code, because cmp $127, %edi can use the cmp $imm8, r/m32 encoding (cmp r/m32, imm8 in Intel syntax in Intel's manual), but 128 would have to use cmp $imm32, r/m32.
BTW, comparisons and conditions make sense in Intel syntax, but are backwards in AT&T syntax. For example, cmp edi, 127 / jg is taken if edi > 127.
But in AT&T syntax, it's cmp $127, %edi, so you have to mentally reverse the operands or think of a > instead of <
The assembly code is comparing a to two, not three. That's why it uses jle. If a is less than or equal to two it logically follows that a IS NOT greater than or equal to 3, and therefore 0 should be returned.

Which is the most efficient way in C to check that at least one of two integers is zero?

I have code that is doing a lot of these comparison operations. I was wondering which is the most efficient one to use. Is it likely that the compiler will correct it if I intentionally choose the "wrong" one?
int a, b;
// Assign a value to a and b.
// Now check whether either is zero.
// The worst?
if (a * b == 0) // ...
// The best?
if (a & b == 0) // ...
// The most obvious?
if (a == 0 || b == 0) // ...
Other ideas?
In general, if there's a fast way of doing a simple thing, you can assume the compiler will do it that fast way. And remember that the compiler is outputting machine language, not C -- the fastest method probably can't be correctly represented as a set of C constructs.
Also, the third method there is the only one that always works. The first one fails if a and b are 1<<16, and the second you already know doesn't work.
It's possible to see which variant generates fewer assembly instructions, but it's a separate matter to see which one actually executes in less time.
To help you analyze the first matter, learn to use your C compiler's command-line flags to capture its intermediate output. GCC is a common choice for a C compiler. Let's look at its unoptimized assembly code for two different programs.
#include <stdio.h>
void report_either_zero()
{
int a = 1;
int b = 0;
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Save that text to a file such as zero-test.c, and run the following command:
gcc -S zero-test.c
GCC will emit a file called zero-test.s, which is the assembly code it would normally submit to the assembler as it generates object code.
Let's look at the relevant fragment of the assembly code. I'm using gcc version 4.2.1 on Mac OS X generating x86 64-bit instructions.
_report_either_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $32, %rsp
Ltmp2:
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $1, -20(%rbp) // a = 1
movl $0, -24(%rbp) // b = 0
movl -24(%rbp), %eax // Get ready to compare a.
cmpl $0, %eax // Does zero equal a?
je LBB1_2 // If so, go to label LBB1_2.
movl -24(%rbp), %eax // Otherwise, get ready to compare b.
cmpl $0, %eax // Does zero equal b?
jne LBB1_3 // If not, go to label LBB1_3.
LBB1_2:
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_3:
addq $32, %rsp
popq %rbp
ret
Leh_func_end1:
You can see where the we load the integer values 1 and 0 into registers, then prepare to compare the first to zero, and then again with the second if the first is nonzero.
Now let's try a different approach with the comparison, to see how the assembly code changes. Note that this is not the same predicate; this one checks whether both numbers are zero.
#include <stdio.h>
void report_both_zero()
{
int a = 1;
int b = 0;
if (!(a | b))
{
puts("Both of them are zero.");
}
}
The assembly code is a little different:
_report_both_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
movl $1, -4(%rbp) // a = 1
movl $0, -8(%rbp) // b = 0
movl -4(%rbp), %eax // Get ready to operate on a.
movl -8(%rbp), %ecx // Get ready to operate on b too.
orl %ecx, %eax // Combine a and b via bitwise OR.
cmpl $0, %eax // Does zero equal the result?
jne LBB1_2 // If not, go to label LBB1_2.
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_2:
addq $16, %rsp
popq %rbp
ret
Leh_func_end1:
If the first number is zero, the first variant does less work—in terms of the number of assembly instructions involved—by avoiding a second register move. If the first number is not zero, the second variant does less work by avoiding a second comparison to zero.
The question now is whether "move, move, bitwise or, compare" runs faster that "move, compare, move, compare." The answer could come down to things like whether the processor learns to predict how often the first integer is zero, and whether it is or not consistently.
If you ask the compiler to optimize this code, the example is too simple; the compiler decides at compile time that no comparison is necessary, and just condenses that code to an unconditional request to write the string. It's interesting to change the code to operate on parameters rather than constants, and see how the optimizer handles the situation differently.
Variant one:
#include <stdio.h>
void report_either_zero(int a, int b)
{
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Variant two (again, a different predicate):
#include <stdio.h>
void report_both_zero(int a, int b)
{
if (!(a | b))
{
puts("Both of them are zero.");
}
}
Generate the optimized assembly code with this command:
gcc -O -S zero-test.c
Let us know what you find.
This will not likely have much (if any, given modern compiler optimizers) effect on the overall performance of your app. If you really must know, you should write some code to test the performance of each for your compiler. However, as a best guess, I'd say...
if ( !( a && b ) )
This will short-circuit if the first happens to be 0.
If you want to find whether or not one of two integers is zero using one comparison instruction ...
if ((a << b) == a)
If a is zero, then no amount of shifting it to the left will change its value.
If b is zero, then there is no shifting performed.
It is possible (I am too lazy to check) that there is some undefined behaviour should b be negative or really large.
However, due to the non-intuitiveness, it would be strongly recommended to implement this as a macro (with an appropriate comment).
Hope this helps.
The most efficient is certainly the most obvious, if by efficiency, you are measuring the programmer's time.
If by measuring efficiency using the processor's time, profiling your candidate solution is the best way to answer - for the target machine you profiled.
But this exercise demonstrated a pitfall of programmer optimization. The 3 candidates are not functionally equivalent for all int.
If you was a functional equivalent alternative...
I think the last candidate and a 4th one deserve comparison.
if ((a == 0) || (b == 0))
if ((a == 0) | (b == 0))
Due to the variation of compilers, optimization and CPU branch prediction, one should profile, rather than pontificate, to determine relative performance. OTOH, a good optimizing compiler may give you the same code for both.
I recommend the code that is easiest to maintain.
There's no "most efficient way to do it in C", if by "efficiency" you mean the efficiency of the compiled code.
Firstly, even if we assume that the compiler translates C language operator into their "obvious" machine counterparts (i.e. C multiplication into machine multiplication etc) the efficiency of each method will differ from one hardware platform to the other. Even if we restrict our consideration to a very specific sequence of instructions on a very specific hardware platform, it still can exhibit different performance in different surrounding contexts, depending, for example, on how well the whole thing agrees with the branch prediction heuristic in the given CPU.
Secondly, modern C compilers rarely translate C operators into their "obvious" machine counterparts. Often the instructions used in machine code will have very little in common with the C code. It is possible that many "completely different" methods of performing the check at C level will actually be translated into the same sequence of machine instructions by a smart compiler. At the same time the same C code might get translated into different sequences machine instructions when the surrounding contexts are different.
In other words, there's no meaningful answer to your question, unless you really really really localize it to a specific hardware platform, specific compiler version and specific set of compilation settings. And that will make it too localized to be useful.
That usually means that in general case the best way to do it is to write the most readable code. Just do
if (a == 0 || b == 0)
The readability of the code will not only help the human reader to understand it, but will also increase the probability of the compiler properly interpreting your intent and generating the most optimal code.
But if you really have to squeeze the last CPU cycle out of your performance-critical code, you have to try different versions and compare their relative efficiency manually.

Resources