Understanding gcc output for if (a>=3) - c

I thought since condition is a >= 3, we should use jl (less).
But gcc used jle (less or equal).
It make no sense to me; why did the compiler do this?

You're getting mixed up by a transformation the compiler made on the way from the C source to the asm implementation. gcc's output implements your function this way:
a = 5;
if (a<=2) goto ret0;
return 1;
ret0:
return 0;
It's all clunky and redundant because you compiled with -O0, so it stores a to memory and then reloads it, so you could modify it with a debugger if you set a breakpoint and still have the code "work".
See also How to remove "noise" from GCC/clang assembly output?
Compilers generally prefer to reduce the magnitude of a comparison constant, so it's more likely to fit in a sign-extended 8-bit immediate instead of needing a 32-bit immediate in the machine code.
We can get some nice compact code by writing a function that takes an arg, so it won't optimize away when we enable optimizations.
int cmp(int a) {
return a>=128; // In C, a boolean converts to int as 0 or 1
}
gcc -O3 on Godbolt, targetting the x86-64 ABI (same as your code):
xorl %eax, %eax # whole RAX = 0
cmpl $127, %edi
setg %al # al = (edi>127) : 1 : 0
ret
So it transformed a >=128 into a >127 comparison. This saves 3 bytes of machine code, because cmp $127, %edi can use the cmp $imm8, r/m32 encoding (cmp r/m32, imm8 in Intel syntax in Intel's manual), but 128 would have to use cmp $imm32, r/m32.
BTW, comparisons and conditions make sense in Intel syntax, but are backwards in AT&T syntax. For example, cmp edi, 127 / jg is taken if edi > 127.
But in AT&T syntax, it's cmp $127, %edi, so you have to mentally reverse the operands or think of a > instead of <

The assembly code is comparing a to two, not three. That's why it uses jle. If a is less than or equal to two it logically follows that a IS NOT greater than or equal to 3, and therefore 0 should be returned.

Related

How do I translate an optimized x86-64 asm loop back to a C for loop?

I have the following:
foo:
movl $0, %eax //result = 0
cmpq %rsi, %rdi // rdi = x, rsi = y?
jle .L2
.L3:
addq %rdi, %rax //result = result + i?
subq $1, %rdi //decrement?
cmp %rdi, rsi
jl .L3
.L2
rep
ret
And I'm trying to translate it to:
long foo(long x, long y)
{
long i, result = 0;
for (i= ; ; ){
//??
}
return result;
}
I don't know what cmpq %rsi, %rdi mean.
Why isn't there another &eax for long i?
I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.
Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.
long foo(long x /*rdi*/, long y /*rsi*/)
{
//movl $0, %eax
long result = 0; /* rax */
//cmpq %rsi, %rdi
//jle .L2
if (x > y) {
do {
//addq %rdi, %rax
result += x;
//subq $1, %rdi
--x;
//cmp %rdi, rsi
//jl .L3
} while (x > y);
}
return result;
}
I don't know what cmpq %rsi, %rdi mean
That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html
You can look up the details of what a single instruction does in an ISA manual.
More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi.
Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.
(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)
From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).
So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).
From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (#BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)
From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)
Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.
I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.
cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.
The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.
sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.
addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.
Putting the pieces together, we arrive at:
long foo(long x, long y)
{
long i, result = 0;
for (i= x ; i>y ; i-- ){
result += i;
}
return result;
}
Which compiler made this code?
From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)
So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.
The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?
gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)
gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.
gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.
gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer
# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
movl $0, %eax #, result
cmpq %rsi, %rdi # y, x
jle .L2 #,
.L3:
addq %rdi, %rax # i, result
subq $1, %rdi #, i
cmpq %rdi, %rsi # i, y
jl .L3 #,
.L2:
rep
ret
So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)
// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
long i, result = 0;
for (i= y ; x>i ; x-- ){
result += x;
}
return result;
}
In practice the way I actually reversed this:
I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.
If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.
(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)

Assembly: CMOVB instruction in Intel x86-64 assembly

I'm a little confused about what "cmovb" does in this assembly code
leal (%rsi, %rsi), %eax // %eax <- %rsi + %rsi
cmpl %esi, %edi // compare %edi and %esi
cmovb %edi, %eax
ret
and the C code for this is:
int foo(unsigned int a, unsigned int b)
{
if(a < b)
return a;
else
return 2*b;
}
Can anyone help me understand how cmovb works here?
Like Jester commented to the question, the cmov* family of instructions are conditional moves, paired via the flags register with a previous (comparison) operation.
You can use for example the Intel documentation as a reference for the x86-64/AMD64 instruction set. The conditional move instructions are shown on page 172 of the combined volume.
cmovb, cmovnae, and cmovc all perform the same way: If the carry flag is set, they move the source operand to the destination operand. Otherwise they do nothing.
If we then look at the preceding instructions that affect flags, we'll see that the cmp instruction (the l suffix is part of AT&T syntax, and means the arguments are "longs") changes the set of flags depending on the difference between the two arguments. In particular, if the second is smaller than the first (in AT&T syntax), the carry flag is set, otherwise the carry flag is cleared; just as if a subtraction was performed without storing the result anywhere. (The cmp instruction affects other flags as well, but they are ignored by the code.)
C MOV B = Conditional MOVe if Below (Carry Flag Set). It literally does what it says, if the condition is met then move. The condition is a<b and the value moved is 2*b
The ABI stores the return value in %edi, so it first stores a and then conditionally overwrites it with 2*b.

Coaxing GCC to emit REPE CMPSB

How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?
I tried some C code like the one listed below, but unsuccessfully:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq
P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.
rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.
InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).
Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)
The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)
However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.
#include <string.h>
int string_equal(const char *s) {
return 0 == strcmp(s, "test1");
}
gcc7.3 -O3 output from Godbolt
.LC0:
.string "test1"
string_cmp:
mov rsi, rdi
mov ecx, 6
mov edi, OFFSET FLAT:.LC0
repz cmpsb
setne al
movzx eax, al
ret
If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).
gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.
The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.
int cmp_mem(const char *s) {
return 0 == memcmp(s, "test1", 6);
}
cmp DWORD PTR [rdi], 1953719668 # 0x74736574
je .L8
.L5:
mov eax, 1
xor eax, 1 # missed optimization here after the memcmp pattern; should just xor eax,eax
ret
.L8:
xor eax, eax
cmp WORD PTR [rdi+4], 49 # check last 2 bytes
jne .L5
xor eax, 1
ret
Controlling this behaviour
The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.
Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.
It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.

x86-64 Assembly "cmovge" to C code

While I shouldn't list out the entire 4 line sample I'm given, (since this is a homework question) I'm confused how this should be read and translated into C.
cmovge %edi, %eax
What I understand so far is that the instruction is a conditional move for when the result is >=. It's comparing the first parameter of a function %edi to the integer register %eax (which was assigned the other parameter value %esi in the previous line of assembly code). However, I don't understand its result.
My problem is interpreting the optimized code. It doesn't manipulate the stack, and I'm not sure how to write this in C (or at least the gcc switch I could even use to generate the same result when compiling).
Could someone please give a few small examples of how the cmovge instruction might translate into C code? If it doesn't make sense as its own line of code, feel free to make something up with it.
This is in x86-64 assembly through a virtualized Linux operating system (CentOS 7).
I'm probably giving you the whole solution here:
int
doit(int a, int b) {
return a >= b ? a : b;
}
With gcc -O3 -masm=intel becomes:
doit:
.LFB0:
.cfi_startproc
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
.cfi_endproc

Why is adding a superfluous mask and bitshift more optimizable?

While writing an integer to hex string function I noticed that I had an unnecessary mask and bit shift, but when I removed it, the code actually got bigger (by about 8-fold)
char *i2s(int n){
static char buf[(sizeof(int)<<1)+1]={0};
int i=0;
while(i<(sizeof(int)<<1)+1){ /* mask the ith hex, shift it to lsb */
// buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
}
for(i=0;buf[i++]=='0';)
/*find first non-zero*/;
return (char *)buf+i;
}
With the extra bit shift and mask and compiled with gcc -S -O3, the loops unroll and it reduces to:
movb $48, buf.1247
xorl %eax, %eax
movb $48, buf.1247+1
movb $48, buf.1247+2
movb $48, buf.1247+3
movb $48, buf.1247+4
movb $48, buf.1247+5
movb $48, buf.1247+6
movb $48, buf.1247+7
movb $48, buf.1247+8
.p2align 4,,7
.p2align 3
.L26:
movzbl buf.1247(%eax), %edx
addl $1, %eax
cmpb $48, %dl
je .L26
addl $buf.1247, %eax
ret
Which is what I expected for 32 bit x86 (should be similar,but twice as many movb-like ops for 64bit); however without the seemingly redundant mask and bit shift, gcc can't seem to unroll and optimize it.
Any ideas why this would happen? I am guessing it has to do with gcc being (overly?) cautious with the sign bit. (There is no >>> operator in C, so bitshifting the MSB >> pads with 1s vs. 0s if the sign bit is set)
It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.
gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.
clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.
Obviously this means your function is buggy! And that's not the only bug.
off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).
clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i)));
That compiles to pretty clunky code, though. gcc probably does better with #Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)
In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.
It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).
I put a bugfixed version on godbolt, along with your original.
Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)
gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!
i2s_orig:
mov BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406, 48
cmp BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406+1, 48
mov BYTE PTR buf.1406+2, 48
mov BYTE PTR buf.1406+4, 48
mov BYTE PTR buf.1406+5, 48
mov BYTE PTR buf.1406+6, 48
mov BYTE PTR buf.1406+7, 48
mov BYTE PTR buf.1406+8, 48
mov BYTE PTR buf.1406+9, 0
jne .L7 # testing flags from the compare earlier
jne .L8
jne .L9
jne .L10
jne .L11
sete al
movzx eax, al
add eax, 8
.L3:
add eax, OFFSET FLAT:buf.1406
ret
.L7:
mov eax, 3
jmp .L3
... more of the same, setting eax to 4, or 5, etc.
Putting multiple jne instructions in a row is obviously useless.
I think it has to do that in the shorter version, you are left shifting by ((sizeof(int)<<3)-i<<2) and then right-shifting by that same value later in the expression, so the compiler is able to optimised based on that fact.
Regarding the right-shifting, C++ can do the equivalent of both of Java's operators '>>' and '>>>'. It's just that in [GNU] C++ the result of "x >> y" will depend on whether x is signed or unsigned. If x is signed, then shift-right-arithmetic (SRA, sign-extending) is used, and if x is unsigned, then shift-right-logical (SRL, zero-extending) is used. This way, >> can be used to divide by 2 for both negative and positive numbers.
Unrolling loops is no longer a good idea because: 1) Newer processors come with a micro-op buffer which will often speed up small loops, 2) code bloat makes instruction caching less efficient by taking up more space in L1i. Micro-benchmarks will hide that effect.
The algorithm doesn't have to be that complicated. Also, your algorithm has a problem that it returns '0' for multiples of 16, and for 0 itself it returns an empty string.
Below is a rewrite of the algo which is branch free except for the loop exit check (or completely branch free if compiler decides to unroll it). It is faster, generates shorter code and fixes the multiple-of-16 bug.
Branch-free code is desirable because there is a big penaly (15-20 clock cycles) if the CPU mispredicts a branch. Compare that to the bit operations in the algo: they only take 1 clock cycle each and the CPU is able to execute 3 or 4 of them in the same clock cycle.
const char* i2s_brcfree(int n)
{
static char buf[ sizeof(n)*2+1] = {0};
unsigned int nibble_shifter = n;
for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
char digit = '0' + curr_nibble;
// "promote" to hex if nibble is over 9,
// conditionally adding the difference between ('0'+nibble) and 'A'
enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
*p = digit;
}
return buf;
}
Edit: C++ does not define the result of right shifting negative numbers. I only know that GCC and visual studio do that on the x86 architecture.

Resources