Why GCC uses xor to clear a register? [duplicate] - c

All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax

TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32, because writing a 32-bit reg zeros the upper 32. xor r64, r64 is a waste of a byte, because it needs a REX prefix.
Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.
GP-integer examples:
xor eax, eax ; RAX = 0. Including AL=0 etc.
xor r10d, r10d ; R10 = 0. Still prefer 32-bit operand-size.
xor edx, edx ; RDX = 0
; small code-size alternative: cdq ; zero RDX if EAX is already zero
; SUB-OPTIMAL
xor rax,rax ; waste of a REX prefix, and extra slow on Silvermont
xor r10,r10 ; bad on Silvermont (not dep breaking), same as r10d on other CPUs because a REX prefix is still needed for r10d or r10.
mov eax, 0 ; doesn't touch FLAGS, but not faster and takes more bytes
and eax, 0 ; false dependency. (Microbenchmark experiments might want this)
sub eax, eax ; same as xor on most but not all CPUs; bad on Silvermont for example.
xor cl, cl ; false dep on some CPUs, not a zeroing idiom. Use xor ecx,ecx
mov cl, 0 ; only 2 bytes, and probably better than xor cl,cl *if* you need to leave the rest of ECX/RCX unmodified
Zeroing a vector register is usually best done with pxor xmm, xmm. That's typically what gcc does (even before use with FP instructions).
xorps xmm, xmm can make sense. It's one byte shorter than pxor, but xorps needs execution port 5 on Intel Nehalem, while pxor can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).
On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps and pxor are handled the same way (as vector-integer instructions).
Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.
XMM/YMM/ZMM examples
# Good:
xorps xmm0, xmm0 ; smallest code size (for non-AVX)
pxor xmm0, xmm0 ; costs an extra byte, runs on any port on Nehalem.
xorps xmm15, xmm15 ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX. Code-size is the only penalty.
# Good with AVX:
vpxor xmm0, xmm0, xmm0 ; zeros X/Y/ZMM0
vpxor xmm15, xmm0, xmm0 ; zeros X/Y/ZMM15, still only 2-byte VEX prefix
#sub-optimal AVX
vpxor xmm15, xmm15, xmm15 ; 3-byte VEX prefix because of high source reg
vpxor ymm0, ymm0, ymm0 ; decodes to 2 uops on AMD before Zen2
# Good with AVX512
vpxor xmm15, xmm0, xmm0 ; zero ZMM15 using an AVX1-encoded instruction (2-byte VEX prefix).
vpxord xmm30, xmm30, xmm30 ; EVEX is unavoidable when zeroing zmm16..31, but still prefer XMM or YMM for fewer uops on probable future AMD. May be worth using only high regs to avoid needing vzeroupper in short functions.
# Good with AVX512 *without* AVX512VL (e.g. KNL / Xeon Phi)
vpxord zmm30, zmm30, zmm30 ; Without AVX512VL you have to use a 512-bit instruction.
# sub-optimal with AVX512 (even without AVX512VL)
vpxord zmm0, zmm0, zmm0 ; EVEX prefix (4 bytes), and a 512-bit uop. Use AVX1 vpxor xmm0, xmm0, xmm0 even on KNL to save code size.
See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7 mask registers. SSE/AVX vpcmpeqd is dep-breaking on many (although still needs a uop to write the 1s), but AVX512 vpternlogd for ZMM regs isn't even dep-breaking. Inside a loop consider copying from another register instead of re-creating ones with an ALU uop, especially with AVX512.
But zeroing is cheap: xor-zeroing an xmm reg inside a loop is usually as good as copying, except on some AMD CPUs (Bulldozer and Zen) which have mov-elimination for vector regs but still need an ALU uop to write zeros for xor-zeroing.
What's special about zeroing idioms like xor on various uarches
Some CPUs recognize sub same,same as a zeroing idiom like xor, but all CPUs that recognize any zeroing idioms recognize xor. Just use xor so you don't have to worry about which CPU recognizes which zeroing idiom.
xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):
smaller code-size than mov reg,0. (All CPUs)
avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)
Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.
The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.
See my answer on another question about zeroing registers for some more details.
Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).
On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.
Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).
xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.
From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):
The processor recognizes the XOR of a register with itself as setting
it to zero. A special tag in the register remembers that the high part
of the register is zero so that EAX = AL. This tag is remembered even
in a loop:
; Example 7.9. Partial register problem avoided in loop
xor eax, eax
mov ecx, 100
LL:
mov al, [esi]
mov [edi], eax ; No extra uop
inc esi
add edi, 4
dec ecx
jnz LL
(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as
you don't get an interrupt, misprediction, or other serializing event.
pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.
xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.
It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.
Ideally, you should use xor / set flags / setcc / read full register:
...
call some_func
xor ecx,ecx ; zero *before* the test
test eax,eax
setnz cl ; cl = (some_func() != 0)
add ebx, ecx ; no partial-register penalty here
This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.
There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.
Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).
Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.
Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)
and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor and many disadvantages.
It's useful only for writing microbenchmarks when you want a dependency as part of a latency test, but want to create a known value by zeroing and adding.
See http://agner.org/optimize/ for microarch details, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)
Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL.
See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and #Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.
TL:DR:
If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, but sometimes you can xor-zero ahead of the thing that sets flags if you have a spare register.
mov-zero ahead of setcc is better for latency than movzx reg32, reg8 after (except on Intel when you can pick different registers), but worse code size.

Related

Could you use C inline assembly to align instructions? (without Compiler optimizations)

I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?
TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a do{}while() loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).
In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic asm("foo"); inside a function, but in debug mode (the -O0 default, aka optimizations disabled) each statement (including asm();) compiles to a separate block of asm in source order. So you case doesn't actually need Extended asm(".p2align 4" ::: "memory") to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses of asm().
How this actually compiles
This does not exactly work; a C for loop compiles to some asm instructions before the asm loop. Especially when using a for(a;b;c) loop with some before-first-iteration initialization in statement a! You can of course pull that out in the source, but GCC's -O0 strategy for compiling while and for loops is to enter the loop with a jmp to the condition at the bottom.
But that jmp alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)
void foo(register int *p)
{
// always use .p2align n or .balign 1<<n so it's unambiguous across targets like MacOS vs. Linux, never .align
asm(" .p2align 5 # from inline asm");
for (register int *endp = p + 102400; p<endp ; p++) {
*p += 123;
}
}
Compiles as follows on the Godbolt compiler explorer. Note that the way I used register meant I got not-terrible asm despite the debug build, and didn't have to combine p++ into p++ <= endp or *(p++) += 123; to make store/reload overhead less bad (because there isn't any in the first place for register locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.
# GCC11.3 -O0 (the default with no options, except for -masm=intel added by Godbolt)
foo:
push rbp
mov rbp, rsp
push rbx # GCC stupidly picks a call-preserved reg it has to save
mov rax, rdi
.p2align 5 # from inline asm
lea rbx, [rax+409600] # endp = p+102400
jmp .L2 # jump to the p<endp condition before the first iteration
## The actual top of the loop. 9 bytes past the alignment boundary
.L3: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx # A memory destination add dword [rax], 123 would be 2 uops for the front-end (fused-domain) on Intel, vs. 3 for 3 separate instructions.
add rax, 4 # p++
.L2:
cmp rax, rbx
jb .L3 # }while(p<endp)
nop
nop # These aren't for alignment, IDK what this is for.
mov rbx, QWORD PTR [rbp-8] # restore RBX
leave # and restore RBP / tear down stack frame
ret
This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that lea rbx, [rax+409600] is a 7-byte instruction, while jmp .L2 is 2 bytes, and that the address of the top of the loop is 0x401149, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.
(Godbolt "binary" mode compiles and links into an executable, and runs objdump -d on that. That also lets us see the .p2align directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)
A fairly obvious transformation gets the LEA before the .p2align. (See the asm in the Godbolt link for all of these source versions if you're curious).
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
for ( ; p < endp ; p++) {
*p += 123;
}
Or while (p < endp){... ; p++} also does the trick. The top of the asm loop becomes the following, with only a 2-byte jmp to the loop condition. So this is pretty decent, and gets most of the benefit.
lea rbx, [rax+409600]
.p2align 5 # from inline asm
jmp .L5 # 2-byte instruction
.L6:
It might be possible to achieve the same thing with for(foo=bar, asm(".p2align 4) ; p<endp ; p++). But if you're declaring a variable in the first part of a for statement, the comma operator won't work to let you sneak in a separate statement.
To actually align the asm loop, we can write it as a do{}while.
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
do {
*p += 123;
p++;
}while(p < endp);
lea rbx, [rax+409600]
.p2align 5 # from inline asm
.L8: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
cmp rax, rbx
jb .L8 # while(p<endp)
No jmp at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try -falign-labels=32 to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at -O0.)
Since I'm hard-coding a non-zero size, no p == endp check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could do if(n==0) return; before the loop. Or more generally, put the loop inside an if like GCC does when compiling a for or while loop with optimization enabled, if it can't prove that it always runs at least one iteration.
if(n!=0) {
register int *endp = p + n;
asm (".p2align 4");
do {
...
}while(p!=endp);
}
Getting GCC to do this for you: -falign-loops=16 doesn't work at -O0
GCC -O2 enables -falign-loops=16:11:8 or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two .p2align directives, with a padding limit on the first one (see the GAS manual).
.p2align 4,,10 # what GCC does on its own
.p2align 3
But using -falign-loops=16 does nothing at -O0. It seems GCC -O0 doesn't know what a loop is. :P
However, GCC does respect -falign-labels even at -O0. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.
# gcc -O0 -falign-labels=16
## from compiling endp=...; asm(); while() {}
lea rbx, [rax+409600] # endp = ...
.p2align 5 # from inline asm
jmp .L5
.p2align 4 # from GCC itself, pads another 14 bytes to an odd multiple of 16 (if you didn't remove the manual .p2align 5)
.L6:
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
.p2align 4 # from GCC itself: one 5-byte NOP in this particular case
.L5:
cmp rax, rbx
jb .L6
Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a do{}while() loop, but in that case it also seems to work to use asm() to put an alignment directive there.
(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include .p2align. If I just wanted to see where the inline asm went, I could have used asm("nop #hi mom") to make it visible with directives filtered out.)
If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with 0xf, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction like cdq, but pipelines are often 4 instructions wide, or more.
(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass -Wa,-mbranches-within-32B-boundaries to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.
Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using register on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.
Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2: -O0 is dumb, but register int i can help
See
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except register variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.
You definitely want to use register int i to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead of arr[i] where appropriate, especially for ISAs that don't have an indexed addressing mode.
register variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding which register var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)
Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g. for(register int i=1000000 ; --i ; ); probably runs 1 iteration per clock, vs. 5 or 6 without register on a modern x86-64 CPU like Skylake.
If using an integer variable as an array index, make it intptr_t or uintptr_t (#include <stdint.h>) if possible, so the compiler doesn't have to redo sign-extension from 32-bit int to 64-bit pointer width for use in addressing modes.
(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.

CSAPP: Why use subq followed by movq when we have pushq already [duplicate]

I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though.
To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results).
Here is the forum topic about this and simular problems.
In short, I want to understand which code is better. Code like this:
sub esp, c
mov [esp+8],eax
mov [esp+4],ecx
mov [esp],edx
...
add esp, c
or code like this:
push eax
push ecx
push edx
...
add esp, c
What compiler can produce the second kind of code? They usually produce some variation of the first one.
You're right, push is a minor missed-optimization with all 4 major x86 compilers. There's some code-size, and thus indirectly performance to be had. Or maybe more directly a small amount of performance in some cases, e.g. saving a sub rsp instruction.
But if you're not careful, you can make things slower with extra stack-sync uops by mixing push with [rsp+x] addressing modes. pop doesn't sound useful, just push. As the forum thread you linked suggests, you only use this for the initial store of locals; later reloads and stores should use normal addressing modes like [rsp+8]. We're not talking about trying to avoid mov loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!
Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. (Hans Passant)
This was true 15 years ago, but compilers are once again using push when optimizing for speed, not just code-size. Compilers already use push/pop for saving/restoring call-preserved registers they want to use, like rbx, and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers). Both of these things could be done with mov, but compilers use push because it's more efficient than sub rsp,8 / mov [rsp], rbx. gcc has tuning options to avoid push/pop for these cases, enabled for -mtune=pentium3 and -mtune=pentium, and similar old CPUs, but not for modern CPUs.
Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient. Now we can use them (carefully!) when tuning for performance. See Agner Fog's microarchitecture guide and instruction tables, and his asm optimization manual. They're excellent. (And other links in the x86 tag wiki.)
It's not perfect; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs. e.g. push rax / mov [rsp-8], rdi is 3 total fused-domain uops: 2 stores and one stack-sync.
On function entry, the "stack engine" is already in a non-zero-offset state (from the call in the parent), so using some push instructions before the first direct reference to RSP costs no extra uops at all. (Unless we were tailcalled from another function with jmp, and that function didn't pop anything right before jmp.)
It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data. The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.
int extfunc(int *,int *);
void foo() {
int a=1, b=2;
extfunc(&a, &b);
}
compiles with clang6.0 -O3 -march=haswell on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them):
# compiled for the x86-64 System V calling convention:
# integer args in rdi, rsi (,rdx, rcx, r8, r9)
push rax # clang / ICC ALREADY use push instead of sub rsp,8
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1 # 6 bytes: opcode + modrm + imm32
mov rsi, rsp # special case for lea rsi, [rsp + 0]
mov dword ptr [rsi], 2
call extfunc(int*, int*)
pop rax # and POP instead of add rsp,8
ret
And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason. (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone).
clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8. (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization.
But anyway, more optimal would be:
push 2 # only 2 bytes
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1
mov rsi, rsp # special case for lea rsi, [rsp + 0]
call extfunc(int*, int*)
# ... later accesses would use [rsp] and [rsp+] if needed, not pop
pop rax # alternative to add rsp,8
ret
The push is an 8-byte store, and we overlap half of it. This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half. Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.
Note that in 64-bit mode, 32-bit push is not available. So we still have to reference rsp directly for the upper half of of the qword. But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push.
We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs. In other cases maybe you just need to spill some function args for use after a call. (Although normally compilers will push rbx and mov rbx,rdi to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.)
I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push, so we can optimize away the sub rsp, ## (or dummy push) entirely.
I could have used mov rax, 0x0000000200000001 / push rax, but 10-byte mov r64, imm64 takes 2 entries in the uop cache, and a lot of code-size.
gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov in this case. If both constants had needed 32-bit immediates, it would have made sense. But if the values weren't actually constant at all, and came from registers, this wouldn't work while push / mov [rsp+4] would. (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.)
If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub instead of multiple dummy PUSHes after the last useful PUSH. But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.
We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp] in the arg setup for the call. It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov instead of lea. (mov is definitely not worse, and better on some CPUs.)
An interesting example of not making locals contiguous is a version of the above with 3 args, int a=1, b=2, c=3;. To maintain 16B alignment, we now need to offset 8 + 16*1 = 24 bytes, so we could do
bar3:
push 3
push 2 # don't interleave mov in here; extra stack-sync uops
push 1
mov rdi, rsp
lea rsi, [rsp+8]
lea rdx, [rdi+16] # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
call extfunc3(int*,int*,int*)
add rsp, 24
ret
This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2 has to use the mov r/m32, imm32 encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov.
push imm8 is extremely compact, 2 bytes. mov dword ptr [rsp+8], 1 is 8 bytes: opcode + modrm + SIB + disp8 + imm32. (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.)
One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later. If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent. Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack, as far as I know. We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things. If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization. (IDK how hard it might be to implement a search for opportunities to use push, especially if you keep it simple and don't spend time optimizing the stack layout for it.)
However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early. So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax to store between two 32-bit values we pushed.
A tiny array of long, with non-constant contents
int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
long arr[] = {a,b,c};
ext_longarr(arr);
}
gcc/clang/ICC/MSVC follow their normal pattern, and use mov stores:
longarr_arg(long, long, long): # #longarr_arg(long, long, long)
sub rsp, 24
mov rax, rsp # this is clang being silly
mov qword ptr [rax], rdi # it could have used [rsp] for the first store at least,
mov qword ptr [rax + 8], rsi # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
mov qword ptr [rax + 16], rdx
mov rdi, rax
call ext_longarr(long*)
add rsp, 24
ret
But it could have stored an array of the args like this:
longarr_arg_handtuned:
push rdx
push rsi
push rdi # leave stack 16B-aligned
mov rsp, rdi
call ext_longarr(long*)
add rsp, 24
ret
With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack. This is a very synthetic example that does nearly nothing else. I could have used volatile int a = 1;, but some compilers treat that extra-specially.
Reasons for not building stack frames gradually
(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. So at least before making any call instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.
But that can't be right, because alloca and C99 variable-length arrays would violate that. There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.
This gcc mailing list post about disabling -maccumulate-outgoing-args for tune=default (in 2014) was interesting. It pointed out that more push/pop led to larger unwind info (.eh_frame section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code. Related: this shows what -maccumulate-outgoing-args does for gcc code-gen.
Obviously the examples I chose were trivial, where we're pushing the input parameters unmodified. More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.
If you have to spill/reload anything between function entry and later pushes, you're creating extra stack-sync uops on Intel. On AMD, it could still be a win to do push rbx / blah blah / mov [rsp-32], eax (spill to the red zone) / blah blah / push rcx / imul ecx, [rsp-24], 12345 (reload the earlier spill from what's still the red-zone, with a different offset)
Mixing push and [rsp] addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower. sub / mov is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.
"It's hard to keep track of the offsets" is a totally bogus argument. It's a computer; re-calculating offsets from a changing reference is something it has to do anyway when using push to put function args on the stack. I think compilers could run into problems (i.e. need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov store below RSP (into what's still the red-zone) before moving RSP down with future push instructions.
Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider. push wasn't as efficient before Pentium-M introduce the stack engine, so efficient push even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.
Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.
This requires disabling stack frames as well though.
It doesn't, actually. Simple stack frame initialisation can use either enter or push ebp \ mov ebp, esp \ sub esp, x (or instead of the sub, a lea esp, [ebp - x] can be used). Instead of or additionally to these, values can be pushed onto the stack to initialise the variables, or just pushing any random register to move the stack pointer without initialising to any certain value.
Here's an example (for 16-bit 8086 real/V 86 Mode) from one of my projects: https://bitbucket.org/ecm/symsnip/src/ce8591f72993fa6040296f168c15f3ad42193c14/binsrch.asm#lines-1465
save_slice_farpointer:
[...]
.main:
[...]
lframe near
lpar word, segment
lpar word, offset
lpar word, index
lenter
lvar word, orig_cx
push cx
mov cx, SYMMAIN_index_size
lvar word, index_size
push cx
lvar dword, start_pointer
push word [sym_storage.main.start + 2]
push word [sym_storage.main.start]
The lenter macro sets up (in this case) only push bp \ mov bp, sp and then lvar sets up numeric defs for offsets (from bp) to variables in the stack frame. Instead of subtracting from sp, I initialise the variables by pushing into their respective stack slots (which also reserves the stack space needed).

What is the lea instruction doing in this piece of code? [duplicate]

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,
long m12(long x) {
return x*12;
}
In assembly,
leaq (%rdi, %rdi, 2), %rax
salq $2, $rax
If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?
lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.
See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.
Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)
The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.
(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)
Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.
Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.
lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.
Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).
The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.
So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.
x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)
It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:
;; Intel syntax.
lea eax, [rdi + rsi*4 - 8] ; 3 cycle latency on Intel SnB-family
; 2-component LEA is only 1c latency
;;; without LEA:
mov eax, esi ; maybe 0 cycle latency, otherwise 1
shl eax, 2 ; 1 cycle latency
add eax, edi ; 1 cycle latency
sub eax, 8 ; 1 cycle latency
On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.
lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:
non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.
7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.
In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.
Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.
See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.
Operand-size vs. address-size for x86-64 lea
See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.
x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)
Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.
This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.
leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.
Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).
So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).
†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.
LEA is for calculating the address. It doesn't dereference the memory address
It should be much more readable in Intel syntax
m12(long):
lea rax, [rdi+rdi*2]
sal rax, 2
ret
So the first line is equivalent to rax = rdi*3
Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

Push/pop around a call on x86_64 (SysV ABI) [duplicate]

In the assembly of the C++ source below. Why is RAX pushed to the stack?
RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?
The code:-
#include <functional>
void f(std::function<void()> a)
{
a();
}
Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:
f(std::function<void ()>): # #f(std::function<void ()>)
push rax
cmp qword ptr [rdi + 16], 0
je .LBB0_1
add rsp, 8
jmp qword ptr [rdi + 24] # TAILCALL
.LBB0_1:
call std::__throw_bad_function_call()
I'm sure the reason is obvious, but I'm struggling to figure it out.
Here's a tailcall without the std::function<void()> wrapper for comparison:
void g(void(*a)())
{
a();
}
The trivial:
g(void (*)()): # #g(void (*)())
jmp rdi # TAILCALL
The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.
call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.
(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).
The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is
passed on stack) byte boundary. In other words, the value (%rsp + 8) is always
a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.
As #CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>):
cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
je .L7 #,
jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
sub rsp, 8 #,
call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.
In other cases, clang typically fixes up the stack before returning with a pop rcx.
Using push has an upside for efficiency in code-size (push is only 1 byte vs. 4 bytes for sub rsp, 8), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp directly because the call that brought us to the top of the current function makes the stack engine "dirty").
This long and rambling answer discusses the worst-case performance risks of using push rax / pop rcx for aligning the stack, and whether or not rax and rcx are good choices of register. (Sorry for making this so long.)
(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al or ax are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push to align the stack by 16 for another call unless it was already saving/restoring some call-preserved regs for its own use.)
Using push rax instead of sub rsp, 8 introduces a dependency on the old value of rax, so you'd think it might slow things down if the value of rax is the result of a long-latency dependency chain (and/or a cache miss).
e.g. the caller might have done something slow with rax that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);
# example caller that leaves RAX not-ready for a long time
mov rdi, rax ; prepare function arg
div rbx ; very high latency
mov rax, [table + rdx] ; rax = table[ value % something ], may miss in cache
mov [rsp + 24], rax ; spill the result.
call foo ; foo uses push rax to align the stack
Fortunately out-of-order execution will do a good job here.
The push doesn't make the value of rsp dependent on rax. (It's either handled by the stack engine, or on very old CPUs push decodes to multiple uops, one of which updates rsp independently of the uops that store rax. Micro-fusion of the store-address and store-data uops let push be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)
As long as nothing depends on the output push rax / pop rcx, it's not a problem for out-of-order execution. If push rax has to wait because rax isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push because the instruction that's slow to produce rax, and whatever instruction in the caller consumes rax before the call are even older, and can't retire either until rax is ready. Retirement has to happen in-order in case of exceptions / interrupts.
(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call. The caller's instruction that consumes rax definitely can't execute/retire until our push can do the same.)
When rax does become ready, push can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.
So even in the worst case, where the instruction that produces rax was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push retires.)
A push rax that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8 that could execute would only be consuming a ROB entry, and not much else.)
It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx uop's load address is ready, so it should dispatch to a load port and execute. (When the pop load executes, it finds that its address matches the incomplete push store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)
Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub) is worse.
(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)
ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)
There are cases where push rax / pop rcx can be more dangerous:
The caller of this function knows that rcx is call-clobbered, so won't read the value. But it might have a false dependency on rcx after we return, like bsf rcx, rax / jnz or test eax,eax / setz cl. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl has a false dep on rcx. bsf actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.
The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx with instructions dependent on its inputs.
It would be worse to use push rbx/pop rbx to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx would be written right before the call, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx and rbp.)
On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax with push could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al before the call. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax at all.
It would be nice to push a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil and sil, so rdi and rsi would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl or cl as 8-bit scratch registers, using dil or sil even in tiny functions where nothing else is using rdx or rcx. (Although lack of low8 renaming in some CPUs means that setcc cl has a false dependency on the old rcx, so setcc dil is safer if the flag-setting was dependent on the function arg in rdi.)
pop rcx at the end "cleans" rcx of any partial-register stuff. Since cl is used for shift counts, and functions do sometimes write just cl even when they could have written ecx instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)
push rdi would probably be a good choice in a lot of cases, since the rest of the function also reads rdi, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push out of the way if rax is ready before rdi, though.
Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8, that would be 2 ALU uops at the top of the function.

Using LEA on values that aren't addresses / pointers?

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,
long m12(long x) {
return x*12;
}
In assembly,
leaq (%rdi, %rdi, 2), %rax
salq $2, $rax
If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?
lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.
See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.
Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)
The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.
(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)
Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.
Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.
lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.
Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).
The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.
So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.
x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)
It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:
;; Intel syntax.
lea eax, [rdi + rsi*4 - 8] ; 3 cycle latency on Intel SnB-family
; 2-component LEA is only 1c latency
;;; without LEA:
mov eax, esi ; maybe 0 cycle latency, otherwise 1
shl eax, 2 ; 1 cycle latency
add eax, edi ; 1 cycle latency
sub eax, 8 ; 1 cycle latency
On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.
lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:
non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.
7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.
In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.
Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.
See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.
Operand-size vs. address-size for x86-64 lea
See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.
x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)
Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.
This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.
leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.
Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).
So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).
†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.
LEA is for calculating the address. It doesn't dereference the memory address
It should be much more readable in Intel syntax
m12(long):
lea rax, [rdi+rdi*2]
sal rax, 2
ret
So the first line is equivalent to rax = rdi*3
Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

Resources