Why does gcc add this movss instruction only with _mm_set_ss? - c

Consider these two functions using SSE:
#include <xmmintrin.h>
int ftrunc1(float f) {
return _mm_cvttss_si32(_mm_set1_ps(f));
}
int ftrunc2(float f) {
return _mm_cvttss_si32(_mm_set_ss(f));
}
Both are exactly the same in behaviour for any input. But the assembler output is different:
ftrunc1:
pushl %ebp
movl %esp, %ebp
cvttss2si 8(%ebp), %eax
leave
ret
ftrunc2:
pushl %ebp
movl %esp, %ebp
movss 8(%ebp), %xmm0
cvttss2si %xmm0, %eax
leave
ret
That is, ftrunc2 uses one movss instruction extra!
Is this normal? Does it matter? Should _mm_set1_ps always be preferred over _mm_set_ss when you only need to set the bottom element?
Compiler used was GCC 4.5.2 with -O3 -msse.

_mm_set_ss maps directly to an assembly instruction (movss). But _mm_set1_ps does not.
From what I've seen on GCC, MSVC, and ICC:
SSE intrinsics that map one-to-one to an assembly instruction are generally treated "as-is" - a black box. So the compiler will only optimizations that apply to the entire instruction itself. But it will not attempt to do any optimizations that require dataflow/dependency analysis on the individual vector elements.
The _mm_set1_ps and _mm_set_ps intrinsics do not map to a single instruction and have special case handling by most compilers. From what I've seen, all three of the compilers I've listed above do attempt to perform dataflow analysis optimizations on the individual elements.
When you put it all together, the second example leaves the movss because the compiler doesn't realize that the top 3 elements don't matter. (It makes no attempt to "open up" the _mm_set_ss intrinsic.)

You're running into a quirk of the peephole optimizer. For some reason in the first case it figures out that it can fold the mov into the cvttss2si and in the second case it fails. The question is, does it matter? The extra move instruction is almost free -- it takes up an extra 4 bytes in the instruction stream and an extra decode slot, but both sequences require the same number of execution slots and the same number of load/store slots (which is what usually matters). The only potential sticking point is the 4 extra bytes of ifetch -- but since ftrunc1 uses 10 bytes and ftrunc2 uses 14, both will fit in a single cache line, so you won't see any difference. For minimizing that overhead, I'd be far more concerned about the unneeded %ebp cruft (are you compiling with -fno-omit-frame-pointer? -- I though -O3 included -fomit-frame-pointer by default). You'll do even better by inlining this function, which will likely completely change what the peephole optimizer sees, and so may make it work better in either case (or even reverse the cases where it works better) -- there's no way to tell without compiling larger programs and looking at the assembly code.
Bottom line, there's unlikely to be any measurable speed difference between the two...

Related

"movl" instruction in assembly

I am learning assembly and reading "Computer Systems: A programmer's perspective". In Practice Problem 3.3, it says movl %eax,%rdx will generate an error. The answer keys says movl %eax,%dx Destination operand incorrect size. I am not sure if this is a typo or not, but my question is: is movl %eax,%rdx a legal instruction? I think it is moving the 32 bits in %eax with zero extension to %rdx, which will not be generated as movzql since
an instruction generating a 4-byte value with a register as the destination will fill the upper 4 bytes with zeros` (from the book).
I tried to write some C code to generate it, but I always get movslq %eax, %rdx(GCC 4.8.5 -Og). I am completely confused.
The GNU assembler doesn't accept movl %eax,%rdx. It also doesn't make sense for the encoding, since mov must have a single operand size (using a prefix byte if needed), not two different sized operands.
The effect you want is achieved by movl %eax, %edx since writes to a 32-bit register always zero-extend into the corresponding 64-bit register. See Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.
movzlq %eax, %rdx might make logical sense, but it's not supported since it would be redundant.

Could you use C inline assembly to align instructions? (without Compiler optimizations)

I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?
TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a do{}while() loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).
In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic asm("foo"); inside a function, but in debug mode (the -O0 default, aka optimizations disabled) each statement (including asm();) compiles to a separate block of asm in source order. So you case doesn't actually need Extended asm(".p2align 4" ::: "memory") to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses of asm().
How this actually compiles
This does not exactly work; a C for loop compiles to some asm instructions before the asm loop. Especially when using a for(a;b;c) loop with some before-first-iteration initialization in statement a! You can of course pull that out in the source, but GCC's -O0 strategy for compiling while and for loops is to enter the loop with a jmp to the condition at the bottom.
But that jmp alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)
void foo(register int *p)
{
// always use .p2align n or .balign 1<<n so it's unambiguous across targets like MacOS vs. Linux, never .align
asm(" .p2align 5 # from inline asm");
for (register int *endp = p + 102400; p<endp ; p++) {
*p += 123;
}
}
Compiles as follows on the Godbolt compiler explorer. Note that the way I used register meant I got not-terrible asm despite the debug build, and didn't have to combine p++ into p++ <= endp or *(p++) += 123; to make store/reload overhead less bad (because there isn't any in the first place for register locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.
# GCC11.3 -O0 (the default with no options, except for -masm=intel added by Godbolt)
foo:
push rbp
mov rbp, rsp
push rbx # GCC stupidly picks a call-preserved reg it has to save
mov rax, rdi
.p2align 5 # from inline asm
lea rbx, [rax+409600] # endp = p+102400
jmp .L2 # jump to the p<endp condition before the first iteration
## The actual top of the loop. 9 bytes past the alignment boundary
.L3: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx # A memory destination add dword [rax], 123 would be 2 uops for the front-end (fused-domain) on Intel, vs. 3 for 3 separate instructions.
add rax, 4 # p++
.L2:
cmp rax, rbx
jb .L3 # }while(p<endp)
nop
nop # These aren't for alignment, IDK what this is for.
mov rbx, QWORD PTR [rbp-8] # restore RBX
leave # and restore RBP / tear down stack frame
ret
This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that lea rbx, [rax+409600] is a 7-byte instruction, while jmp .L2 is 2 bytes, and that the address of the top of the loop is 0x401149, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.
(Godbolt "binary" mode compiles and links into an executable, and runs objdump -d on that. That also lets us see the .p2align directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)
A fairly obvious transformation gets the LEA before the .p2align. (See the asm in the Godbolt link for all of these source versions if you're curious).
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
for ( ; p < endp ; p++) {
*p += 123;
}
Or while (p < endp){... ; p++} also does the trick. The top of the asm loop becomes the following, with only a 2-byte jmp to the loop condition. So this is pretty decent, and gets most of the benefit.
lea rbx, [rax+409600]
.p2align 5 # from inline asm
jmp .L5 # 2-byte instruction
.L6:
It might be possible to achieve the same thing with for(foo=bar, asm(".p2align 4) ; p<endp ; p++). But if you're declaring a variable in the first part of a for statement, the comma operator won't work to let you sneak in a separate statement.
To actually align the asm loop, we can write it as a do{}while.
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
do {
*p += 123;
p++;
}while(p < endp);
lea rbx, [rax+409600]
.p2align 5 # from inline asm
.L8: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
cmp rax, rbx
jb .L8 # while(p<endp)
No jmp at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try -falign-labels=32 to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at -O0.)
Since I'm hard-coding a non-zero size, no p == endp check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could do if(n==0) return; before the loop. Or more generally, put the loop inside an if like GCC does when compiling a for or while loop with optimization enabled, if it can't prove that it always runs at least one iteration.
if(n!=0) {
register int *endp = p + n;
asm (".p2align 4");
do {
...
}while(p!=endp);
}
Getting GCC to do this for you: -falign-loops=16 doesn't work at -O0
GCC -O2 enables -falign-loops=16:11:8 or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two .p2align directives, with a padding limit on the first one (see the GAS manual).
.p2align 4,,10 # what GCC does on its own
.p2align 3
But using -falign-loops=16 does nothing at -O0. It seems GCC -O0 doesn't know what a loop is. :P
However, GCC does respect -falign-labels even at -O0. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.
# gcc -O0 -falign-labels=16
## from compiling endp=...; asm(); while() {}
lea rbx, [rax+409600] # endp = ...
.p2align 5 # from inline asm
jmp .L5
.p2align 4 # from GCC itself, pads another 14 bytes to an odd multiple of 16 (if you didn't remove the manual .p2align 5)
.L6:
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
.p2align 4 # from GCC itself: one 5-byte NOP in this particular case
.L5:
cmp rax, rbx
jb .L6
Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a do{}while() loop, but in that case it also seems to work to use asm() to put an alignment directive there.
(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include .p2align. If I just wanted to see where the inline asm went, I could have used asm("nop #hi mom") to make it visible with directives filtered out.)
If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with 0xf, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction like cdq, but pipelines are often 4 instructions wide, or more.
(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass -Wa,-mbranches-within-32B-boundaries to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.
Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using register on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.
Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2: -O0 is dumb, but register int i can help
See
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except register variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.
You definitely want to use register int i to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead of arr[i] where appropriate, especially for ISAs that don't have an indexed addressing mode.
register variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding which register var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)
Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g. for(register int i=1000000 ; --i ; ); probably runs 1 iteration per clock, vs. 5 or 6 without register on a modern x86-64 CPU like Skylake.
If using an integer variable as an array index, make it intptr_t or uintptr_t (#include <stdint.h>) if possible, so the compiler doesn't have to redo sign-extension from 32-bit int to 64-bit pointer width for use in addressing modes.
(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.

GCC inserts a PXOR when casting int to float [duplicate]

I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown:
.text
test:
#xorps %xmm0, %xmm0
cvtsi2ss %edi, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
retq
.global test
This function follows GCC/Clang's x86-64 calling convention for the function declaration extern "C" float test(int); Note the commented out xorps instruction. uncommenting this instruction dramatically improves the performance of the function. Testing it using my machine with an i7-8700K, Google benchmark shows the function without the xorps instruction takes 8.54ns (CPU), while the function with the xorps instruction takes 1.48ns. I've tested this on multiple computers with various OS's, processors, processor generations, and different processor manufacturers (Intel and AMD), and they all exhibit a similar performance difference. Repeating the addss instruction makes the slowdown more pronounced (to a point), and this slowdown still occurs using other instructions here (eg. mulss) or even a mix of instructions so long as they all depend on the value in %xmm0 in some way. It's worth pointing out that only calling xorps each function call results in the performance improvement. Sampling the performance with a loop (as Google Benchmark does) with the xorps call outside the loop still shows the slower performance.
Since this is a case where exclusively adding instructions improves performance, this appears to be caused by something really low-level in the CPU. Since it occurs across a wide variety of CPU's, it seems like this must be intentional. However, I couldn't find any documentation that explains why this happens. Does anybody have an explanation for what's going on here? The issue seems to be dependent on complicated factors, as the slowdown I saw in my original code only occurred on a specific optimization level (-O2, sometimes -O1, but not -Os), without inlining, and using a specific compiler (Clang, but not GCC).
cvtsi2ss %edi, %xmm0 merges the float into the low element of XMM0 so it has a false dependency on the old value. (Across repeated calls to the same function, creating one long loop-carried dependency chain.)
xor-zeroing breaks the dep chain, allowing out-of-order exec to work its magic. So you bottleneck on addss throughput (0.5 cycles) instead of latency (4 cycles).
Your CPU is a Skylake derivative so those are the numbers; earlier Intel have 3 cycle latency, 1 cycle throughput using a dedicated FP-add execution unit instead of running it on the FMA units. https://agner.org/optimize/. Probably function call/ret overhead prevents you from seeing the full 8x expected speedup from the latency * bandwidth product of 8 in-flight addss uops in the pipelined FMA units; you should get that speedup if you remove xorps dep-breaking from a loop within a single function.
GCC tends to be very "careful" about false dependencies, spending extra instructions (front-end bandwidth) to break them just in case. In code that bottlenecks on the front-end (or where total code size / uop-cache footprint is a factor) this costs performance if the register was actually ready in time anyway.
Clang/LLVM is reckless and cavalier about it, typically not bothering to avoid false dependencies on registers not written in the current function. (i.e. assuming / pretending that registers are "cold" on function entry). As you show in comments, clang does avoid creating a loop-carried dep chain by xor-zeroing when looping inside one function, instead of via multiple calls to the same function.
Clang even uses 8-bit GP-integer partial registers for no reason in some cases where that doesn't save any code-size or instructions vs. 32-bit regs. Usually it's probably fine, but there's a risk of coupling into a long dep chain or creating a loop-carried dependency chain if the caller (or a sibling function call) still has a cache-miss load in flight to that reg when we're called, for example.
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for more about how OoO exec can overlap short to medium length independent dep chains. Also related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is about unrolling a dot-product with multiple accumulators to hide FMA latency.
https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html has performance details for this instruction across various uarches.
You can avoid this if you can use AVX, with vcvtsi2ss %edi, %xmm7, %xmm0 (where xmm7 is any register you haven't written recently, or which is earlier in a dep chain that leads to the current value of EDI).
As I mentioned in Why does the latency of the sqrtsd instruction change based on the input? Intel processors
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd %src,%src, %dst at least for register sources if not memory, and similarly vcvtsi2sd %eax, %cold_reg, %dst for the similarly near-sightedly designed scalar int->fp conversion instructions.
(GCC missed-optimization reports: 80586, 89071, 80571.)
If cvtsi2ss/sd had zeroed the upper elements of registers we wouldn't have this stupid problem / wouldn't need to sprinkle xor-zeroing instruction around; thanks Intel. (Another strategy is to use SSE2 movd %eax, %xmm0 which does zero-extend, then packed int->fp conversion which operates on the whole 128-bit vector. This can break even for float where the int->fp scalar conversion is 2 uops, and the vector strategy is 1+1. But not double where the int->fp packed conversion costs a shuffle + FP uop.)
This is exactly the problem that AMD64 avoided by making writes to 32-bit integer registers implicitly zero-extend to the full 64-bit register instead of leaving it unmodified (aka merging). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (writing 8 and 16-bit registers do cause false dependencies on AMD CPUs, and Intel since Haswell).

What does an lea instruction after a ret instruction mean?

I found x86 lea instructions in an executable file made using clang and gcc.
The lea instructions are after the ret instruction as shown below.
0x???????? <func>
...
pop %ebx
pop %ebp
ret
lea 0x0(%esi,%eiz,1),%esi
lea 0x0(%edi,%eiz,1),%edi
0x???????? <next_func>
...
What are these lea instructions used for? There is no jmp instruction to the lea instructions.
My environment is Ubuntu 12.04 32-bit and gcc 4.6.3.
It's probably not anything--it's just padding to let the next function start at an address that's probably a multiple of at least 8 (and quite possibly 16).
Depending on the rest of the code, it's possible that it's actually a table. Some implementations of a switch statement, for example, use a constant table that's often stored in the code segment (even though, strictly speaking, it's more like data than code).
The first is a lot more likely though. As an aside, such space is often filled with 0x03 instead. This is a single-byte debug-break instruction, so if some undefined behavior results in attempting to execute that code, it immediately stops execution and breaks to the debugger (if available).

Why does gcc use movl instead of push to pass function args?

pay attention to this code :
#include <stdio.h>
void a(int a, int b, int c)
{
char buffer1[5];
char buffer2[10];
}
int main()
{
a(1,2,3);
}
after that :
gcc -S a.c
that command shows our source code in assembly.
now we can see in the main function, we never use "push" command to push the arguments of
the a function into the stack. and it used "movel" instead of that
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $3, 8(%esp)
movl $2, 4(%esp)
movl $1, (%esp)
call a
leave
why does it happen?
what's difference between them?
Here is what the gcc manual has to say about it:
-mpush-args
-mno-push-args
Use PUSH operations to store outgoing parameters. This method is shorter and usually
equally fast as method using SUB/MOV operations and is enabled by default.
In some cases disabling it may improve performance because of improved scheduling
and reduced dependencies.
-maccumulate-outgoing-args
If enabled, the maximum amount of space required for outgoing arguments will be
computed in the function prologue. This is faster on most modern CPUs because of
reduced dependencies, improved scheduling and reduced stack usage when preferred
stack boundary is not equal to 2. The drawback is a notable increase in code size.
This switch implies -mno-push-args.
Apparently -maccumulate-outgoing-args is enabled by default, overriding -mpush-args. Explicitly compiling with -mno-accumulate-outgoing-args does revert to the PUSH method, here.
2019 update: modern CPUs have had efficient push/pop since about Pentium M.
-mno-accumulate-outgoing-args (and using push) eventually became the default for -mtune=generic in Jan 2014.
That code is just directly putting the constants (1, 2, 3) at offset positions from the (updated) stack pointer (esp). The compiler is choosing to do the "push" manually with the same result.
"push" both sets the data and updates the stack pointer. In this case, the compiler is reducing that to only one update of the stack pointer (vs. three). An interesting experiment would be to try changing function "a" to take only one argument, and see if the instruction pattern changes.
gcc does all sorts of optimizations, including selecting instructions based upon execution speed of the particular CPU being optimized for. You will notice that things like x *= n is often replaced by a mix of SHL, ADD and/or SUB, especially when n is a constant; while MUL is only used when the average runtime (and cache/etc. footprints) of the combination of SHL-ADD-SUB would exceed that of MUL, or n is not a constant (and thus using loops with shl-add-sub would come costlier).
In case of function arguments: MOV can be parallelized by hardware, while PUSH cannot. (The second PUSH has to wait for the first PUSH to finish because of the update of the esp register.) In case of function arguments, MOVs can be run in parallel.
Is this on OS X by any chance? I read somewhere that it requires the stack pointer to be aligned at 16-byte boundaries. That could possibly explain this kind of code generation.
I found the article: http://blogs.embarcadero.com/eboling/2009/05/20/5607

Resources