Why does ICC produce "inc" instead of "add" in assembly on x86?

Why does ICC produce "inc" instead of "add" in assembly on x86? - c

While fiddling with simple C code, I noticed something strange. Why does ICC produces incl %eax in assembly code generated for increment instead of addl $1, %eax? GCC behaves as expected though, using add.
Example code (-O3 used on both GCC and ICC)
int A, B, C, D, E;
void foo()
{
A = B + 1;
B = 0;
C++;
D++;
D++;
E += 2;
}
Result on ICC
L__routine_start_foo_0:
foo:
movl B(%rip), %eax #5.13
movl D(%rip), %edx #8.9
incl %eax #5.17
movl E(%rip), %ecx #10.9
addl $2, %edx #9.9
addl $2, %ecx #10.9
movl %eax, A(%rip) #5.9
movl $0, B(%rip) #6.9
incl C(%rip) #7.9
movl %edx, D(%rip) #9.9
movl %ecx, E(%rip) #10.9
ret
For example, see here.
As such, I'm wondering - is this an intended feature, a bug or some quirk resulting from some specific setting? If add is (supposedly) better due to flags update or efficiency (which is the conclusion based on the links below) - why does ICC use inc?
Related:
Relative performance of x86 inc vs. add instruction
Is ADD 1 really faster than INC ? x86
GCC doesn't make use of inc
Note:
I'm asking this question explicitly because none of the questions I found or was directed to on SO does explain this behaviour. My previous question concerning this matter got closed because, supposedly, it's trivial and has been answered. I don't find it trivial. I didn't find an answer in all of the links and answers given. It's not another "how to plug my mouse into my PC" problem. All of the questions explain why add is/could be better on new x86 processors or why GCC uses it, but none concerns ICC.
Any insight on ICC design choices would be also very welcome.
PS I don't consider "it does it because it does" a valid answer.

It is not unreasonable to assume at this point that incl was selected as it takes only one byte (0x40) instead of three (0x83 0xc0 0x01).

Related

The assembly of “b++”

In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other)：
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits

The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).

The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)

Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.

It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.

Converting x86 to Y86

I'm trying to figure out to convert this x86 assembly code to Y86 form:
Given the c program:
int sum(int x) {
if (x == 0 || x ==1) {
return 1;
} else {
return x + sum(x-1);
}
}
The following x86-64 assembly code is generated:
sum:
cmpl $1, %rdi
ja .L8
movl $1, %eax
ret
.L8:
pushq %rbx
movl %edi, %ebx
leal -1(%rdi), %edi
call sum
addl %ebx, %eax
popq %rbx
ret
How can I convert this to Y86-64 assembly code that does the same thing?
Thank you!

In this case, you can convert by replacing each instruction with a short sequence of y86 instructions which does exactly the same thing.
y86 is Turing complete, but very crippled, so in general you can't always easily convert. Some single x86 instructions might need an entire loop or very long function to implement, but that's not the case for any of your instructions. Each of them can be transliterated to one or a few y86 instructions. (Some might need a scratch register; I forget if y86 has compare with immediate or only mov-immediate to register.)
Your code doesn't have any multiplies, shifts, or bsf, or floating-point, or anything else that y86 doesn't have (and would need a loop to emulate).
Look up each x86 instruction in the instruction-set reference manual (like this online version, or this older one where not having AVX/AVX2 instructions means less to wade through. See also the x86 tag wiki for links to Intel and AMD's PDF manuals.) Look at the Operation section where pseudo-code describes the exact effect of the instruction on the architectural state. That's the behaviour you want to implement using y86 instructions.
As an example, I forget if y86 has push / pop, but if not you can always manipulate rsp directly and load/store. e.g. sub $8, %rsp ; movrm %rbx, (rsp) is push (except it clobbers flags where x86's push doesn't).

Preprocessor Definitions VS Local Variables, Speed Difference

I just compiled the following C code to test out the gcc optimizer (using the -O3 flag), expecting that both functions would end up generating the same set of assembly instructions:
int test1(int a, int b)
{
#define x (a*a*a+b)
#define y (a*b*a+3*b)
return x*x+x*y+y;
#undef x
#undef y
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
But I was surprised to find that they generated slightly different assembly, and that the execution time for test1 (the code using the preprocessor instead of local variables) was a bit faster.
I've heard people say that the compiler can optimize better than humans can, and that you should tell it exactly what you want it to do; man I guess they weren't kidding. I thought the compiler was supposed to kind of guess at the programmer's intended use of local variables and replace their use if necessary... is that a false assumption?
When writing code for performance, are you better off using preprocessor definitions for the sake of readability rather than local variables? I know it looks ugly as hell, but apparently it actually makes a difference, unless I'm missing something.
Here's the assembly I got, using "gcc test.c -O3 -S". My gcc version is 4.8.2; it looks like the assembly output is the same for most versions of gcc, but not on 4.7 or 4.8 versions for some reason
test1
movl %edi, %eax
movl %edi, %edx
leal (%rsi,%rsi,2), %ecx
imull %edi, %eax
imull %esi, %edx
imull %edi, %eax
imull %edi, %edx
addl %esi, %eax
addl %ecx, %edx
leal (%rax,%rdx), %ecx
imull %ecx, %eax
addl %edx, %eax
ret
test2
movl %edi, %eax
leal (%rsi,%rsi,2), %edx
imull %edi, %eax
imull %edi, %eax
leal (%rax,%rsi), %ecx
movl %edi, %eax
imull %esi, %eax
imull %edi, %eax
addl %eax, %edx
leal (%rcx,%rdx), %eax
imull %ecx, %eax
addl %edx, %eax
ret

Trying your code at godbolt I get identical assembly for both functions with GCC, even with -O setting. Only by omitting -O flag I get different results. And this really is expected because the code is trivial to optimize.
Here is generated assembly using gcc 4.4.7 with -O flag. As you can see they are identical.
test1(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret
test2(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret

The answer is twofold:
Your statement about identical results is a misconception
I cannot reproduce your results "test1 faster than test2".
Preprocessor misconception
The results should not be identical. The preprocessor acts on (transforms) the source before it is actually compiled by the compiler with whatever options.
You can inspect the result of the preprocessor by running gcc -E main.c for example, assuming you are using a GNU compiler and your sources above are stored in a file main.c. The relevant parts become:
int test1(int a, int b)
{
return (a*a*a+b)*(a*a*a+b)+(a*a*a+b)*(a*b*a+3*b)+(a*b*a+3*b);
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
Obviously, the first version uses roughly two times more mathematical operations than the second one. Then the compiler and its optimiser come into play …
(NB: Ideally you could analyse the number of CPU cycles generated by the assembler code. Use e.g. gcc -S main.c and look at main.s; you probably know that. Version 2 should "win" in that case.)
Runtime testing and optimising
In order to compare our results, you should post your test code. When testing you need to average out short term fluctuations and time granularity limits of your CPU. Hence you are likely to run in loops over the same code.
int i=100000000;
while (--i>0) {
int r;
r = test1(3, 4);
}
Without optimiser, test1 runs clearly about 20% slower than test2.
However, the optimiser will analyse also the calling code and can optimise away the multiple call with identical arguments or calls with unused variables (r in this case).
Therefore you must fool the compiler to effectively make the calls, alike
int r = 0;
while (--i>0) {
r += test1(3, i);
}
When I tried that, I get identical runtimes with a percent level precision. I.e. sometimes time1 is faster, sometimes time2 is faster, when I repeat the comparison several times.
You should look into the optimiser documentation to understand which optimising options you need to outsmart in your tests.
And I confirm what #Ville Krumlinde states: I get identical code for the assembly output, even with -O level optimisation (gcc 4.4.7 on my desktop). The code only contains 9 operations in assembler, which makes me believe that the optimiser "knows" enough about algebraic optimisation to simplify your formulas.
So you may just be taken by a fake optimiser effect of your test frame after all.

Which operator is faster (> or >=), (< or <=)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is < cheaper (faster) than <=, and similarly, is > cheaper (faster) than >=?
Disclaimer: I know I could measure but that will be on my machine only and I am not sure if the answer could be "implementation specific" or something like that.

TL;DR
There appears to be little-to-no difference between the four operators, as they all perform in about the same time for me (may be different on different systems!). So, when in doubt, just use the operator that makes the most sense for the situation (especially when messing with C++).
So, without further ado, here is the long explanation:
Assuming integer comparison:
As far as assembly generated, the results are platform dependent. On my computer (Apple LLVM Compiler 4.0, x86_64), the results (generated assembly is as follows):
a < b (uses 'setl'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setl %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a <= b (uses 'setle'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setle %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a > b (uses 'setg'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setg %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a >= b (uses 'setge'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setge %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
Which isn't really telling me much. So, we skip to a benchmark:
And ladies & gentlemen, the results are in, I created the following test program (I am aware that 'clock' isn't the best way to calculate results like this, but it'll have to do for now).
#include <time.h>
#include <stdio.h>
#define ITERS 100000000
int v = 0;
void testL()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i < v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testLE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++)
{
v = i <= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testG()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i > v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testGE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i >= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
int main()
{
testL();
testLE();
testG();
testGE();
}
Which, on my machine (compiled with -O0), gives me this (5 separate runs):
testL: 337848
testLE: 338237
testG: 337888
testGE: 337787
testL: 337768
testLE: 338110
testG: 337406
testGE: 337926
testL: 338958
testLE: 338948
testG: 337705
testGE: 337829
testL: 339805
testLE: 339634
testG: 337413
testGE: 337900
testL: 340490
testLE: 339030
testG: 337298
testGE: 337593
I would argue that the differences between these operators are minor at best, and don't hold much weight in a modern computing world.

it varies, first start at examining different instruction sets and how how the compilers use those instruction sets. Take the openrisc 32 for example, which is clearly mips inspired but does conditionals differently. For the or32 there are compare and set flag instructions, compare these two registers if less than or equal unsigned then set the flag, compare these two registers if equal set the flag. Then there are two conditional branch instructions branch on flag set and branch on flag clear. The compiler has to follow one of these paths, but less, than, less than or equal, greater than, etc are all going to use the same number of instructions, same execution time for a conditional branch and same execution time for not doing the conditional branch.
Now it is definitely going to be true for most architectures that performing the branch takes longer than not performing the branch because of having to flush and re-fill the pipe. Some do branch prediction, etc to help with that problem.
Now some architectures the size of the instruction may vary, compare gpr0 and gpr1 vs compare gpr0 and the immediate number 1234, may require a larger instruction, you will see this a lot with x86 for example. so although both cases may be a branch if less than how you encode the less depending on what registers happen to hold what values can make a performance difference (sure x86 does a lot of pipelining, lots of caching, etc to make up for these issues). Another similar example is mips and or32, where r0 is always a zero, it is not really a general purpose register, if you write to it it doesnt change, it is hardwired to a zero, so a compare if equal to 0 MIGHT cost you more than a compare if equal to some other number if an extra instruction or two is required to fill a gpr with that immediate so that the compare can happen, worst case is having to evict a register to the stack or memory, to free up the register to put the immediate in there so that the compare can happen.
Some architectures have conditional execution like arm, for the full arm (not thumb) instructions you can on a per instruction basis execute, so if you had code
if(i==7) j=5; else j=9;
the pseudo code for arm would be
cmp i,#7
moveq j,#5
movne j,#7
there is no actual branch, so no pipeline issues you flywheel right on through, very fast.
One architecture to another if that is an interesting comparison some as mentioned, mips, or32, you have to specifically perform some sort of instruction for the comparision, others like x86, msp430 and the vast majority each alu operation changes the flags, arm and the like change flags if you tell it to change flags otherwise dont as shown above. so a
while(--len)
{
//do something
}
loop the subtract of 1 also sets the flags, if the stuff in the loop was simple enough you could make the whole thing conditional, so you save on separate compare and branch instructions and you save in the pipeline penalty. Mips solves this a little by compare and branch are one instruction, and they execute one instruction after the branch to save a little in the pipe.
The general answer is that you will not see a difference, the number of instructions, execuition time, etc are the same for the various conditionals. special cases like small immediates vs big immediates, etc may have an effect for corner cases, or the compiler may simply choose to do it all differently depending on what comparison you do. If you try to re-write your algorithm to have it give the same answer but use a less than instead of a greater than and equal, you could be changing the code enough to get a different instruction stream. Likewise if you perform too simple of a performance test, the compiler can/will optimize out the comparison complete and just generate the results, which could vary depending on your test code causing different execution. The key to all of this is disassemble the things you want to compare and see how the instructions differ. That will tell you if you should expect to see any execution differences.

stack layout for process

why is the output 44 instead of 12, &x,eip,ebp,&j should be the stack layout for the code
given below i guess, if so then it must have 12 as output, but i am getting it 44?
so help me understand the concept,altough the base pointer is relocated for every instant of execution, the relative must remain unchanged and should not it be 12?
int main() {
int x = 9;
// printf("%p is the address of x\n",&x);
fun(&x );
printf("%x is the address of x\n", (&x));
x = 3;
printf("%d x is \n",x);
return 0;
}
int fun(unsigned int *ptr) {
int j;
printf("the difference is %u\n",((unsigned int)ptr -(unsigned int) &j));
printf("the address of j is %x\n",&j);
return 0;
}

You're assuming that the compiler has packed everything (instead of putting things on specific alignment boundaries), and you're also assuming that the compiler hasn't inlined the function, or made any other optimisations or transformations.
In summary, you cannot make any assumptions about this sort of thing, or rely on any particular behaviour.

No, it should not be 12. It should not be anything. The ISO standard has very little to say about how things are arranged on the stack. An implementation has a great deal of leeway in moving things around and inserting padding for efficiency.
If you pass it through your compiler with an option to generate he assembler code (such as with gcc -S), it will become evident as to what's happening here.
fun:
pushl %ebp
movl %esp, %ebp
subl $40, %esp ;; this is quite large (see below).
movl 8(%ebp), %edx
leal -12(%ebp), %eax
movl %edx, %ecx
subl %eax, %ecx
movl %ecx, %eax
movl %eax, 4(%esp)
movl $.LC2, (%esp)
call printf
leal -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC3, (%esp)
call printf
movl $0, %eax
leave
ret
It appears that gcc is pre-generating the next stack frame (the one going down to printf) as part of the prolog for the fun function. That's in this case. Your compiler may be doing something totally different. But the bottom line is: the implementation can do what it likes as long as it doesn't violate the standard.
That's the code from optimisation level 0, by the way, and it gives a difference of 48. When I use gcc's insane optimisation level 3, I get a difference of 4. Again, perfectly acceptable, gcc usually does some pretty impressive optimisations at that level.