GCC Extended assembly pin local variable to any register except r12

GCC Extended assembly pin local variable to any register except r12 - c

Basically I am looking for a way that I pin a temporary to any register except r12.
I know I can "hint" the compiler to pin to a single register with:
// Toy example. Obviously an unbalanced `pop` in
// extended assembly will cause serious problems.
register long tmp asm("rdi"); // or just clober rdi and use it directly.
asm volatile("pop %[tmp]\n" // using pop hence don't want r12
: [tmp] "=&r" (tmp)
:
:);
and this will generally work as in avoiding r12 but might mess up the compilers register allocation elsewhere.
Is it possible to do this without forcing the compiler to use a single register?

Note that register asm doesn't truly "pin" a variable to a register, it only ensures that uses of that variable as an operand in inline asm will use that register. In principle the variable may be stored elsewhere in between. See https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc/Local-Register-Variables.html#Local-Register-Variables. But it sounds like all you really need is to ensure that your pop instruction doesn't use r12 as its operand, possibly because of Why is POP slow when using register R12?. I'm not aware of any way to do precisely this, but here are some options that may help.
The registers rax, rbx, rcx, rdx, rsi, rdi each have their own constraint letters, a,b,c,d,S,D respectively (the other registers don't). So you can get about halfway there by doing
long tmp;
asm volatile("pop %[tmp]\n"
: [tmp] "=&abcdSD" (tmp)
:
:);
This way the compiler has the option to choose any of those six registers, which should give the register allocator a lot more flexibility.
Another option is to declare that your asm clobbers r12, which will prevent the compiler from allocating operands there:
long tmp;
asm volatile("pop %[tmp]\n"
: [tmp] "=&r" (tmp)
:
: "r12");
The tradeoff is that it will also not use r12 to cache local variables across the asm, since it assumes that it may be modified. Hopefully it will be smart enough to just avoid using r12 in that part of the code at all, but if it can't, it may emit extra register moves or spill to the stack around your asm. Still, it's less brutal than -ffixed-r12 which would prevent the compiler from using r12 anywhere in the entire source file.
Future readers should note that in general it is unsafe to modify the stack pointer inside inline asm on x86-64. The compiler assumes that rsp isn't changed by inline asm, and it may access stack variables via effective addresses with constant offsets relative to rsp, at any time. Moreover, x86-64 uses a red zone, so even a push/pop pair is unsafe, because there may be important data stored below rsp. (And an unexpected pop may mean that other important data is no longer in the red zone and thus subject to overwriting by signal handlers.) So, you shouldn't do this unless you're willing to carefully read the generated assembly after every recompilation to make sure the compiler hasn't decided to do any of these things. (And before you ask, you cannot fix this by declaring a clobber of rsp; that's not supported.)

Related

Could you use C inline assembly to align instructions? (without Compiler optimizations)

I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?

TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a do{}while() loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).
In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic asm("foo"); inside a function, but in debug mode (the -O0 default, aka optimizations disabled) each statement (including asm();) compiles to a separate block of asm in source order. So you case doesn't actually need Extended asm(".p2align 4" ::: "memory") to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses of asm().
How this actually compiles
This does not exactly work; a C for loop compiles to some asm instructions before the asm loop. Especially when using a for(a;b;c) loop with some before-first-iteration initialization in statement a! You can of course pull that out in the source, but GCC's -O0 strategy for compiling while and for loops is to enter the loop with a jmp to the condition at the bottom.
But that jmp alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)
void foo(register int *p)
{
// always use .p2align n or .balign 1<<n so it's unambiguous across targets like MacOS vs. Linux, never .align
asm(" .p2align 5 # from inline asm");
for (register int *endp = p + 102400; p<endp ; p++) {
*p += 123;
}
}
Compiles as follows on the Godbolt compiler explorer. Note that the way I used register meant I got not-terrible asm despite the debug build, and didn't have to combine p++ into p++ <= endp or *(p++) += 123; to make store/reload overhead less bad (because there isn't any in the first place for register locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.
# GCC11.3 -O0 (the default with no options, except for -masm=intel added by Godbolt)
foo:
push rbp
mov rbp, rsp
push rbx # GCC stupidly picks a call-preserved reg it has to save
mov rax, rdi
.p2align 5 # from inline asm
lea rbx, [rax+409600] # endp = p+102400
jmp .L2 # jump to the p<endp condition before the first iteration
## The actual top of the loop. 9 bytes past the alignment boundary
.L3: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx # A memory destination add dword [rax], 123 would be 2 uops for the front-end (fused-domain) on Intel, vs. 3 for 3 separate instructions.
add rax, 4 # p++
.L2:
cmp rax, rbx
jb .L3 # }while(p<endp)
nop
nop # These aren't for alignment, IDK what this is for.
mov rbx, QWORD PTR [rbp-8] # restore RBX
leave # and restore RBP / tear down stack frame
ret
This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that lea rbx, [rax+409600] is a 7-byte instruction, while jmp .L2 is 2 bytes, and that the address of the top of the loop is 0x401149, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.
(Godbolt "binary" mode compiles and links into an executable, and runs objdump -d on that. That also lets us see the .p2align directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)
A fairly obvious transformation gets the LEA before the .p2align. (See the asm in the Godbolt link for all of these source versions if you're curious).
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
for ( ; p < endp ; p++) {
*p += 123;
}
Or while (p < endp){... ; p++} also does the trick. The top of the asm loop becomes the following, with only a 2-byte jmp to the loop condition. So this is pretty decent, and gets most of the benefit.
lea rbx, [rax+409600]
.p2align 5 # from inline asm
jmp .L5 # 2-byte instruction
.L6:
It might be possible to achieve the same thing with for(foo=bar, asm(".p2align 4) ; p<endp ; p++). But if you're declaring a variable in the first part of a for statement, the comma operator won't work to let you sneak in a separate statement.
To actually align the asm loop, we can write it as a do{}while.
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
do {
*p += 123;
p++;
}while(p < endp);
lea rbx, [rax+409600]
.p2align 5 # from inline asm
.L8: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
cmp rax, rbx
jb .L8 # while(p<endp)
No jmp at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try -falign-labels=32 to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at -O0.)
Since I'm hard-coding a non-zero size, no p == endp check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could do if(n==0) return; before the loop. Or more generally, put the loop inside an if like GCC does when compiling a for or while loop with optimization enabled, if it can't prove that it always runs at least one iteration.
if(n!=0) {
register int *endp = p + n;
asm (".p2align 4");
do {
...
}while(p!=endp);
}
Getting GCC to do this for you: -falign-loops=16 doesn't work at -O0
GCC -O2 enables -falign-loops=16:11:8 or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two .p2align directives, with a padding limit on the first one (see the GAS manual).
.p2align 4,,10 # what GCC does on its own
.p2align 3
But using -falign-loops=16 does nothing at -O0. It seems GCC -O0 doesn't know what a loop is. :P
However, GCC does respect -falign-labels even at -O0. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.
# gcc -O0 -falign-labels=16
## from compiling endp=...; asm(); while() {}
lea rbx, [rax+409600] # endp = ...
.p2align 5 # from inline asm
jmp .L5
.p2align 4 # from GCC itself, pads another 14 bytes to an odd multiple of 16 (if you didn't remove the manual .p2align 5)
.L6:
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
.p2align 4 # from GCC itself: one 5-byte NOP in this particular case
.L5:
cmp rax, rbx
jb .L6
Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a do{}while() loop, but in that case it also seems to work to use asm() to put an alignment directive there.
(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include .p2align. If I just wanted to see where the inline asm went, I could have used asm("nop #hi mom") to make it visible with directives filtered out.)
If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with 0xf, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction like cdq, but pipelines are often 4 instructions wide, or more.
(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass -Wa,-mbranches-within-32B-boundaries to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.
Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using register on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.
Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2: -O0 is dumb, but register int i can help
See
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except register variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.
You definitely want to use register int i to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead of arr[i] where appropriate, especially for ISAs that don't have an indexed addressing mode.
register variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding which register var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)
Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g. for(register int i=1000000 ; --i ; ); probably runs 1 iteration per clock, vs. 5 or 6 without register on a modern x86-64 CPU like Skylake.
If using an integer variable as an array index, make it intptr_t or uintptr_t (#include <stdint.h>) if possible, so the compiler doesn't have to redo sign-extension from 32-bit int to 64-bit pointer width for use in addressing modes.
(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.

Why do compilers insist on using a callee-saved register here?

Consider this C code:
void foo(void);
long bar(long x) {
foo();
return x;
}
When I compile it on GCC 9.3 with either -O3 or -Os, I get this:
bar:
push r12
mov r12, rdi
call foo
mov rax, r12
pop r12
ret
The output from clang is identical except for choosing rbx instead of r12 as the callee-saved register.
However, I want/expect to see assembly that looks more like this:
bar:
push rdi
call foo
pop rax
ret
Since you have to push something to the stack anyway, it seems shorter, simpler, and probably faster to just push your value there, instead of pushing some arbitrary callee-saved register's value there and then storing your value in that register. Ditto for the inverse after call foo when you're putting things back.
Is my assembly wrong? Is it somehow less efficient than messing with an extra register? If the answer to both of those are "no", then why don't either GCC or clang do it this way?
Godbolt link.
Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:
long foo(long);
long bar(long x) {
return foo(x * x) - x;
}
I get this:
bar:
push rbx
mov rbx, rdi
imul rdi, rdi
call foo
sub rax, rbx
pop rbx
ret
I'd rather have this:
bar:
push rdi
imul rdi, rdi
call foo
pop rdi
sub rax, rdi
ret
This time, it's only one instruction off vs. two, but the core concept is the same.
Godbolt link.

TL:DR:
Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.
Inlining to create large functions is a better solution most of the time
There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.
Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.
I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/
Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.
If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.
Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)
If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.
Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)
So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.
It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)
IDK if it would be worth teaching compilers to look for this optimization
It's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).
If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.
You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.
Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.
Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)
I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.
i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.
And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers.

Why do compilers insist on using a callee-saved register here?
Because most compilers would generate nearly the same code for a given function, and are following global calling conventions defined by the ABI targeted by your compiler.
You could define your own different calling conventions (e.g. passing even more function arguments in processor registers, or on the contrary "packing" by bitwise operations two short arguments in a single processor register, etc...), and implement your compiler following them. You probably would need to recode some of the C standard library (e.g. patch lower parts of GNU libc then recompile it, if on Linux).
IIRC, some calling conventions are different on Windows and on FreeBSD and on Linux for the same CPU.
Notice that with a recent GCC (e.g. GCC 10 in start of 2021) you could compile and link with gcc -O3 -flto -fwhole-program and in some cases get some inline expansion. You can also build GCC from its source code as a cross-compiler, and since GCC is free software, you can improve it to follow your private new calling conventions. Be sure to document your calling conventions first.
If performance matters to you a lot, you can consider writing your own GCC plugin doing even more optimizations. Your compiler plugin could even implement other calling conventions (e.g. using asmjit).
Consider also improving TinyCC or Clang or NWCC to fit your needs.
My opinion is that in many cases it is not worth spending months of your efforts to improve performance by just a few nanoseconds. But your employer/manager/client could disagree. Consider also compiling (or refactoring) significant parts of your software to silicon, e.g. thru VHDL, or using specialized hardware e.g. GPGPU with OpenCL or CUDA.

Does the -O0 compiler flag have the same effect as the volatile keyword in C?

When you use the -O0 compiler flag in C, you tell the compiler to avoid any kind of optimization. When you define a variable as volatile, you tell the compiler to avoid optimizing that variable. Can we use the two approaches interchangeably? And if so what are the pros and cons? Below are some pros and cons that I can think of. Are there any more?
Pros:
Using the -O0 flag is helpful if we have a big code base inside which the variables that should have been declared as volatile, are not. If the code is showing buggy behavior, instead of going in the code and finding which variables need to be declared as volatile, we can just use the -O0 flag to eliminate the possibility that optimization is causing the problem.
Cons:
The -O0 flag will affect the entire code while the volatile keyword only affects a specific variable. If we're working on a small microcontroller for example, this could be a problem since using -O0 may produce a big executable.

The short answer is: the volatile keyword does not mean "do not optimize". It is something completely different. It informs the compiler that the variable may be changed by something which is not visible for the compiler in the normal program flow. For example:
It can be changed by the hardware - usually registers mapped in the memory address space
Can be changed by the function which is never called - for example the interrupt routine
Variable can be changed by another process or hardware - for example shared memory in the multiprocessor / multicore systems
The volatile variable has to be read from its storage location every time it is used, and saved every time it was changed.
Here you have an example:
int foo(volatile int z)
{
return z + z + z + z;
}
int foo1(int z)
{
return z + z + z + z;
}
and the resulting code (-O0 optimization option)
foo(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-4]
add edx, eax
mov eax, DWORD PTR [rbp-4]
add edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
pop rbp
ret
foo1(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
sal eax, 2
pop rbp
ret
The difference is obvious I think. The volatile variable is read 4 times, non volatile is read once, then multiplied by 4.
You can play yourself here: https://godbolt.org/g/RiTU4g
In the most cases if the program does not run when you turn on the compiler optimization, you have some hidden UBs in your code. You should debug as long as needed to discover all of them. The correctly written program must run at any optimization level.
Bear in mind that `volatile' does not mean or guarantee the coherency & atomicity.

Compiler flag -O0 is in no way a replacement for proper use of volatile, because the code that does not work when it is properly optimized by the compiler is inherently broken. You do not want a broken code giving you an appearance of "working" until someone forgets to throw the -O0 switch.
It is unusual even for large code bases to have a need for many volatile variables, in terms of the total percentage of variables in the code. Fixing a large code base with missing volatile is likely to require finding a few strategic places where multiple variables need to be volatile, and fixing just these few, rather than taking a "shotgun approach" and disabling all optimizations.

Using the -O0 flag is helpful if we have a big code base inside which the variables that should have been declared as volatile, are not
You could use O0 to debug and fix the problems in such cases.
If the code is showing buggy behavior, instead of going in the code and finding which variables need to be declared as volatile, we can just use the -O0 flag to eliminate the possibility that optimization is causing the problem.
That's a wrong conclusion. There's no guarantee that O0 "fixes" the problem due to some variable(s) missing volatile qualifier. The problem still exists in your code and needs to be fixed.
You seem to have misunderstood volatile. It's not something that controls compiler optimisation per se. Whereas O0 typically disables most optimisations (compiler can still optimize though).
In conclusion, no, they are totally different, serving different purposes. As such, there's no question of using one over other or using interchangeably.
There's no reason to disable compiler optimisations. You need to fix the problem in your code i.e, add volatile qualifiers to variable(s) that require it.

The existing answers already cover volatile pretty well, but I believe the root cause of this question has nothing to do with volatile.
If your code works with -O0 but doesn't with optimizations enabled, you may have a wide variety of bugs in your code, or it is also possible that the compiler is buggy. This being tagged "microcontroller", I wouldn't rule out compiler bugs.
It's possible that you have a buffer overrun or underrun, for example, and the optimizer simply arranges your code in a slightly different way which exposes the bug. Try running your code through a static code analyzer (such as cppcheck or llvm's static code analysis). Whether that's a feasible option depends on how microcontroller-specific your code is, though.
Finally, depending on the compiler, -O0 might still generate code that keeps some value in a register for a while unless volatile is used, so I wouldn't call -O0 a replacement for volatile in any case. (That's compiler specific naturally).

Locking register usage for a certain section of code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's consider a situation where we are writing in C code. When the compiler encounters a function call, my understanding is that it does the following:
Push all registers onto the stack
Jump to new function, do stuff in there
Pop old context off the stack back into the registers.
Now, some processors have 1 working register, some 32, some more than that. I'm mostly concerned with the larger number of registers. If my processor has 32 registers, the compiler will need to emit 32 push and pop instructions, just as base overhead for a function call. It would be nice if I could trade some compilation flexibility[1] in the function for less push and pop instructions. That is to say, I would like a way that I could tell the compiler "For function foo(), only use 4 registers. This would imply that the compiler would only need to push/pop 4 registers before jumping to foo().
I realize this is pretty silly to worry about on a modern PC, but I am thinking more for a low speed embedded system where you might be servicing an interrupt very quickly, or calling a simple function over and over. I also realize this could very quickly become an architecture dependant feature. Processors that use a "Source Source -> Dest" instruction set (Like ARM), as opposed to an accumulator (Like Freescale/NXP HC08) might have some lower limit on the number of registers we allow functions to use.
I do know the compiler uses tricks like inlining small functions to increase speed, and I realize I could inform most compilers to not generate the push/pop code and just hand code it myself in assembly, but my question focuses on instructing the compiler to do this from "C-Land".
My question is, are there compilers that allow this? Is this even necessary with optimizing compilers (do they already do this)?
[1] Compilation flexibility: By reducing the number of registers available to the compiler to use in a function body, you are restricting it's flexibility, and it might need to utilize the stack more since it can't just use another register.

When it comes to compilers, registers and function calls you can generally think of the registers falling into one of three categories: "hands off", volatile and non-volatile.
The "hands off" category are those that the compiler will not generally be futzing around with unless you explicitly tell it to (such as with inline assembly). These may include debugging registers and other special purpose registers. The list will vary from platform to platform.
The volatile (or scratch / call-clobbered / caller-saved) set of registers are those that a function can futz around with without the need for saving. That is, the caller understands that the contents of those registers might not be the same after the function call. Thus, if the caller has any data in those registers that it wants to keep, it must save that data before making the call and then restore it after. On a 32-bit x86 platform, these volatile registers (sometimes called scratch registers) are usually EAX, ECX and EDX.
The non-volatile (or call-preserved or callee-saved) set of registers are those that a function must save before using them and restore to their original values before returning. They only need to be saved/restored by the called function if it uses them. On a 32-bit x86 platform, these are usually the remaining general purpose registers: EBX, ESI, EDI, ESP, EBP.
Hope this helps.
(I meant to just add a small example, but quickly got carried away. I would add my own answer if this question wasn't closed, but I'm going to leave this long section here because I think it's interesting. Condense it or edit it out entirely if you don't want it in your answer -- Peter)
For a more concrete example, the SysV x86-64 ABI is well-designed (with args passed in registers, and a good balance of call-preserved vs. scratch/arg regs). There are some other links in the x86 tag wiki explaining what ABIs / calling conventions are all about.
Consider a simple example of with function calls that can't be inlined (because the definition isn't available):
int foo(int);
int bar(int a) {
return 5 * foo(a+2) + foo (a) ;
}
It compiles (on godbolt with gcc 5.3 for x86-64 with -O3 to the following:
## gcc output
# AMD64 SysV ABI: first arg in e/rdi, return value in e/rax
# the call-preserved regs used are: rbp and rbx
# the scratch regs used are: rdx. (arg-passing / return regs are not call-preserved)
push rbp # save a call-preserved reg
mov ebp, edi # stash `a` in a call-preserved reg
push rbx # save another call-preserved reg
lea edi, [rdi+2] # edi=a+2 as an arg for foo. `add edi, 2` would also work, but they're both 3 bytes and little perf difference
sub rsp, 8 # align the stack to a 16B boundary (the two pushes are 8B each, and call pushes an 8B return address, so another 8B is needed)
call foo # eax=foo(a+2)
mov edi, ebp # edi=a as an arg for foo
mov ebx, eax # stash foo(a+2) in ebx
call foo # eax=foo(a)
lea edx, [rbx+rbx*4] # edx = 5*foo(a+2), using the call-preserved register
add rsp, 8 # undo the stack offset
add eax, edx # the add between the to function-call results
pop rbx # restore the call-preserved regs we saved earlier
pop rbp
ret # return value in eax
As usual, compilers could do better: instead of stashing foo(a+2) in ebx to survive the 2nd call to foo, it could have stashed 5*foo(a+2) with a single instruction (lea ebx, [rax+rax*4]). Also, only one call-preserved register is needed, since we don't need a after the 2nd call. This removes a push/pop pair, and also the sub rsp,8 / add rsp,8 pair. (gcc bug report already filed for this missed optimization)
## Hand-optimized implementation (still ABI-compliant):
push rbx # save a call-preserved reg; also aligns the stack
lea ebx, [rdi+2] # stash ebx=a+2
call foo # eax=foo(a)
mov edi, ebx # edi=a+2 as an arg for foo
mov ebx, eax # stash foo(a) in ebx, replacing `a+2` which we don't need anymore
call foo # eax=foo(a+2)
lea eax, [rax+rax*4] #eax=5*foo(a+2)
add eax, ebx # eax=5*foo(a+2) + foo(a)
pop rbx # restore the call-preserved regs we saved earlier
ret # return value in eax
Note that the call to foo(a) happens before foo(a+2) in this version. It saved an instruction at the start (since we can pass on our arg unchanged to the first call to foo), but removed a potential saving later (since the multiply-by-5 now has to happen after the second call, and can't be combined with moving into the call-preserved register).
I could get rid of an extra mov if it was 5*foo(a) + foo(a+2). With the expression as I wrote it, I can't combine arithmetic with data movement (using lea) in every case. Or I'd need to both save a and do a separate add edi,2 before the first call.

Push all registers onto the stack
No. In the vast majority of function calls in optimized code, only a small fraction of all registers are pushed on the stack.
I'm mostly concerned with the larger number of registers.
Do you have any experimental evidence to support this concern? Is this a performance bottleneck?
I could trade some compilation flexibility[1] in the function for less
push and pop instructions.
Modern compilers use sophisticated inter-procedural register allocation. By limiting the number of registers, you will most likely degrade performance.
I realize this is pretty silly to worry about on a modern PC, but I am
thinking more for a low speed embedded system where you might be
servicing an interrupt very quickly, or calling a simple function over
and over.
This is very vague. You have to show the "simple" function, all call sites and specify the compiler and the target embedded system. You need to measure performance (compared to hand-written assembly code) to determine whether this is a problem in the first place.

understanding the keywords eax and mov

I am trying to understand the registers in asm but every website I look at just assumes I know something about registers and I just cannot get a grip on it. I know about a books worth of c++ and as far as I know mov var1,var2 would be the same thing as var1 = var2, correct?
But with the eax register I am completely lost. Any help is appreciated.

Consider registers as per-processor global variables. There's "eax", "ebx", and a bunch of others. Furthermore, you can only perform certain operations via registers - for example there's no instruction to read from one memory location and write it to another (except when the locations are denoted by certain registers - see movsb instruction, etc).
So the registers are generally used only for temporary storage of values that are needed for some operation, but they usually are not used as global variables in the conventional sense.
You are right that "mov var1, var2" is essentially an assignment - but you cannot use two memory-based variables as operands; that's not supported. You could instead do:
mov eax, var1
mov var2, eax
... which has the same effect, using the eax register as a temporary.

eax refers to a processor register (essentially a variable)
mov is an instruction to copy data from one register to another. So essentially you are correct (in a handwavey sense)
Do you have an example assembly block you want to discuss?

Think of eax as a location in memory where a value can be stored, much like in c++ where int, long,... and other types specify the size of the location in memory of a variable. The eax register simply points to a storage location in memory, which on x86 computers is 32 bits. The e part of eax means extended. This register -> memory location is automatically used by the multiplication and division operators and normally called the extended accumulator register.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight