I know the kernel uses the likely and unlikely macros prodigiously. The docs for the macros are located at Built-in Function: long __builtin_expect (long exp, long c). But they don't really discuss the details.
How exactly does a compiler handle likely(x) and __builtin_expect((x),1)?
Is it handled by the code generator or the optimizer?
Does it depend upon optimization levels?
What's an example of the code generated?
I just tested a simple example on gcc.
For x86 this seems to be handled by the optimizer and depend on optimization levels. Although I guess a correct answer here would be "it depends on the compiler".
The code generated is CPU dependent. Some cpus (sparc64 comes immediately to my mind, but I'm sure there are others) have flags on conditional branch instructions that tell the CPU how to predict it, so the compiler generates "predict true/predict false" instructions depending on the built in rules in the compiler and hints from the code (like __builtin_expect).
Intel documents their behavior here: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts . In short the behavior on Intel CPUs is that if the CPU has no previous information about a branch it will predict forward branches as unlikely to be taken, while backwards branches are likely to be taken (think about loops vs. error handling).
This is some example code:
int bar(int);
int
foo(int x)
{
if (__builtin_expect(x>10, PREDICTION))
return bar(10);
return 42;
}
Compiled with (I'm using omit-frame-pointer to make the output more readable, but I still cleaned it up below):
$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=0 -o 00.s foo.c
$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=1 -o 01.s foo.c
$ cc -S -fomit-frame-pointer -O2 -DPREDICTION=0 -o 20.s foo.c
$ cc -S -fomit-frame-pointer -O2 -DPREDICTION=1 -o 21.s foo.c
There's no difference between 00.s and 01.s, so that shows that this is dependent on optimization (for gcc at least).
Here's the (cleaned up) generated code for 20.s:
foo:
cmpl $10, %edi
jg .L2
movl $42, %eax
ret
.L2:
movl $10, %edi
jmp bar
And here is 21.s:
foo:
cmpl $10, %edi
jle .L6
movl $10, %edi
jmp bar
.L6:
movl $42, %eax
ret
As expected the compiler rearranged the code so that the branch we don't expect to take is done in a forward branch.
Related
I'm trying to get the values of the assembly registers rdi, rsi, rdx, rcx, r8, but I'm getting the wrong value, so I don't know if what I'm doing is taking those values or telling the compiler to write on these registers, and if that's the case how could I achieve what I'm trying to do (Put the value of assembly registers in C variables)?
When this code compiles (with gcc -S test.c)
#include <stdio.h>
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
it outputs the following assembly code (before the function call):
movl $1, %edi
movl $2, %esi
movl $3, %edx
movl $4, %ecx
movl $5, %r8d
callq _beautiful_function
When I compile and execute it outputs this:
0
0
4294967296
140732705630496
140732705630520
(some undefined values)
What did I do wrong ? and how could I do this?
Your code didn't work because Specifying Registers for Local Variables explicitly tells you not to do what you did:
The only supported use for this feature is to specify registers for input and output operands when calling Extended asm (see Extended Asm).
Other than when invoking the Extended asm, the contents of the specified register are not guaranteed. For this reason, the following uses are explicitly not supported. If they appear to work, it is only happenstance, and may stop working as intended due to (seemingly) unrelated changes in surrounding code, or even minor changes in the optimization of a future version of gcc:
Passing parameters to or from Basic asm
Passing parameters to or from Extended asm without using input or output operands.
Passing parameters to or from routines written in assembler (or other languages) using non-standard calling conventions.
To put the value of registers in variables, you can use Extended asm, like this:
long rdi, rsi, rdx, rcx;
register long r8 asm("r8");
asm("" : "=D"(rdi), "=S"(rsi), "=d"(rdx), "=c"(rcx), "=r"(r8));
But note that even this might not do what you want: the compiler is within its rights to copy the function's parameters elsewhere and reuse the registers for something different before your Extended asm runs, or even to not pass the parameters at all if you never read them through the normal C variables. (And indeed, even what I posted doesn't work when optimizations are enabled.) You should strongly consider just writing your whole function in assembly instead of inline assembly inside of a C function if you want to do what you're doing.
Even if you had a valid way of doing this (which this isn't), it probably only makes sense at the top of a function which isn't inlined. So you'd probably need __attribute__((noinline, noclone)). (noclone is a GCC attribute that clang will warn about not recognizing; it means not to make an alternate version of the function with fewer actual args, to be called in the case where some of them are known constants that can get propagated into the clone.)
register ... asm local vars aren't guaranteed to do anything except when used as operands to Extended Asm statements. GCC does sometimes still read the named register if you leave it uninitialized, but clang doesn't. (And it looks like you're on a Mac, where the gcc command is actually clang, because so many build scripts use gcc instead of cc.)
So even without optimization, the stand-alone non-inlined version of your beautiful_function is just reading uninitialized stack space when it reads your rdi C variable in const long save_rdi = rdi;. (GCC does happen to do what you wanted here, even at -Os - optimizes but chooses not to inline your function. See clang and GCC (targeting Linux) on Godbolt, with asm + program output.).
Using an asm statement to make register asm do something
(This does what you say you want (reading registers), but because of other optimizations, still doesn't produce 1 2 3 4 5 with clang when the caller can see the definition. Only with actual GCC. There might be a clang option to disable some relevant IPA / IPO optimization, but I didn't find one.)
You can use an asm volatile() statement with an empty template string to tell the compiler that the values in those registers are now the values of those C variables. (The register ... asm declarations force it to pick the right register for the right variable)
#include <stdlib.h>
#include <stdio.h>
__attribute__((noinline,noclone))
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
// "activate" the register-asm locals:
// associate register values with C vars here, at this point
asm volatile("nop # asm statement here" // can be empty, nop is just because Godbolt filters asm comments
: "=r"(rdi), "=r"(rsi), "=r"(rdx), "=r"(rcx), "=r"(r8) );
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
This makes asm in your beautiful_function that does capture the incoming values of your registers. (It doesn't inline, and the compiler happens not to have used any instructions before the asm statement that steps on any of those registers. The latter is not guaranteed in general.)
On Godbolt with clang -O3 and gcc -O3
gcc -O3 does actually work, printing what you expect. clang still prints garbage, because the caller sees that the args are unused, and decides not to set those registers. (If you'd hidden the definition from the caller, e.g. in another file without LTO, that wouldn't happen.)
(With GCC, noninline,noclone attributes are enough to disable this inter-procedural optimization, but not with clang. Not even compiling with -fPIC makes that possible. I guess the idea is that symbol-interposition to provide an alternate definition of beautiful_function that does use its args would violate the one definition rule in C. So if clang can see a definition for a function, it assumes that's how the function works, even if it isn't allowed to actually inline it.)
With clang:
main:
pushq %rax # align the stack
# arg-passing optimized away
callq beautiful_function#PLT
# indirect through the PLT because I compiled for Linux with -fPIC,
# and the function isn't "static"
xorl %eax, %eax
popq %rcx
retq
But the actual definition for beautiful_function does exactly what you want:
# clang -O3
beautiful_function:
pushq %r14
pushq %rbx
nop # asm statement here
movq %rdi, %r9 # copying all 5 register outputs to different regs
movq %rsi, %r10
movq %rdx, %r11
movq %rcx, %rbx
movq %r8, %r14
leaq .L.str(%rip), %rdi
xorl %eax, %eax
movq %r9, %rsi # then copying them to printf args
movq %r10, %rdx
movq %r11, %rcx
movq %rbx, %r8
movq %r14, %r9
popq %rbx
popq %r14
jmp printf#PLT # TAILCALL
GCC wastes fewer instructions, just for example starting with movq %r8, %r9 to move your r8 C var as the 6th arg to printf. Then movq %rcx, %r8 to set up the 5th arg, overwriting one of the output registers before it's read all of them. Something clang was over-cautious about. However, clang does still push/pop %r12 around the asm statement; I don't understand why. It ends by tailcalling printf, so it wasn't for alignment.
Related:
How to specify a specific register to assign the result of a C expression in inline assembly in GCC? - the opposite problem: materialize a C variable value in a specific register at a certain point.
Reading a register value into a C variable - the previous canonical Q&A which uses the now-unsupported register ... asm("regname") method like you were trying to. Or with a register-asm global variable, which hurts efficiency of all your code by leaving it otherwise untouched.
I forgot I'd answered that Q&A, making basically the same points as this. And some other points, e.g. that this doesn't work on registers like the stack pointer.
In this golfing answer I saw a trick where the return value is the second parameter which is not passed in.
int f(i, j)
{
j = i;
}
int main()
{
return f(3);
}
From gcc's assembly output it looks like when the code copies j = i it stores the result in eax which happens to be the return value.
f:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $3, %edi
movl $0, %eax
call f
popq %rbp
ret
So, did this happen just by being lucky? Is this documented by gcc? It only works with -O0, but it works with a bunch of values of i I tried, -m32, and a bunch of different versions of GCC.
gcc -O0 likes to evaluate expressions in the return-value register, if a register is needed at all. (GCC -O0 generally just likes to have values in the retval register, but this goes beyond picking that as the first temporary.)
I've tested a bit, and it really looks like GCC -O0 does this on purpose across multiple ISAs, sometimes even using an extra mov instruction or equivalent. IIRC I made an expression more complicated so the result of evaluation ended up in another register, but it still copied it back to the retval register.
Things like x++ that can (on x86) compile to a memory-destination inc or add won't leave the value in a register, but assignments typically will. So it's note quite like GCC is treating function bodies like GNU C statement-expressions.
This is not documented, guaranteed, or standardized by anything. It's an implementation detail, not something intended for you to take advantage of like this.
"Returning" a value this way means you're programming in "GCC -O0", not C. The wording of the code-golf rules says that programs have to work on at least one implementation. But my reading of that is that they should work for the right reasons, not because of some side-effect implementation detail. They break on clang not because clang doesn't support some language feature, just because they're not even written in C.
Breaking with optimization enabled is also not cool; some level of UB is generally acceptable in code golf, like integer wraparound or pointer-casting type punning being things that one might reasonably wish were well-defined. But this is pure abuse of an implementation detail of one compiler, not a language feature.
I argued this point in comments under the relevant answer on Codegolf.SE C golfing tips Q&A (Which incorrectly claims it works beyond GCC). That answer has 4 downvotes (and deserves more IMO), but 16 upvotes. So some members of the community disagree that this is terrible and silly.
Fun fact: in ISO C++ (but not C), having execution fall off the end of a non-void function is Undefined Behaviour, even if the caller doesn't use the result. This is true even in GNU C++; outside of -O0 GCC and clang will sometimes emit code like ud2 (illegal instruction) for a path of execution that reaches the end of a function without a return. So GCC doesn't in general define the behaviour here (which implementations are allowed to do for things that ISO C and C++ leaves undefined. e.g. gcc -fwrapv defines signed overflow as 2's complement wraparound.)
But in ISO C, it's legal to fall off the end of a non-void function: it only becomes UB if the caller uses the return value. Without -Wall GCC may not even warn. Checking return value of a function without return statement
With optimization disabled, function inlining won't happen so the UB isn't really compile-time visible. (Unless you use __attribute__((always_inline))).
Passing a 2nd arg merely gives you something to assign to. It's not important that it's a function arg. But i=i; optimizes away even with -O0 so you do need a separate variable. Also just i; optimizes away.
Fun fact: a recursive f(i){ f(i); } function body does bounce i through EAX before copying it to the first arg-passing register. So GCC just really loves EAX.
movl -4(%rbp), %eax
movl %eax, %edi
movl $0, %eax # without a full prototype, pass # of FP args in AL
call f
i++; doesn't load into EAX; it just uses a memory-destination add without loading into a register. Worth trying with gcc -O0 for ARM.
From Optimization Compiler on Wikipedia,
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources.
and GCC has a lot of optimization options.
I'd like to study the generated assembly (the one -S gives) after each optimization GCC performs when compiling with different flags like -O1, -O2, -O3, etc.
How can I do this?
Edit: My input will be C code.
Intermediate representation can be saved to files using -fdump-tree-all switch.
There are more fine-grained -fdump switches awailable.
See gcc manual for details.
To be able to read these representations, take a look into GCC internals manual.
gcc -S (Capital S)
gives asm output, but the assembler can change things so I prefer to just make an object
gcc -c -o myfile.o myfile.c
then disassemble
objdump -D myfile.o
Understand that this is not linked so external branch destinations and other external addresses will have a placeholder instead of a real number. If you want to see the optimizations compile with no optimizations (-O0) then compile with -O1 then -O2 and -O3 and see what if anything changes. There are other optimzation flags as well you can play with. To see the difference you need to compile with and without the flags and compare the differences yourself.
diff won't work, you will see why (register allocation changes).
Whilst it is possible to take a small piece of code, compile it with -S and with a variety of options, the difficulty is understanding what actually changed. It only takes a small change to make code quite different - one variable going into a register means that register is no longer available for something, causing knock-on effects to all of the remaining code in the function.
I was comparing the same code from two nearly identical functions earlier today (to do with a question on C++), and there was ONE difference in the source code. One change in which variable was used for the termination condition inside one for-loop led to over lines of assembler code changing. Because the compiler decided to arrange the registers in a different way, using a different register for one of the main variables, and then everything else changed as a consequence.
I've seen cases where adding a small change to a function moves it from being inlined to not being inlined, which in turn makes big changes to ALL of the code in the program that calls that code.
So, yes, by all means, compile very simple code with different optimisations, and use -S to inspect the compiler generated code. Then compare the different variants to see what effect it has. But unless you are used to reading assembler code, and understand what you are actually looking for, it can often be hard to see the forest for the trees.
It is also worth considering that optimisation steps often work in conjunction - one step allows another step to do its work (inline leads to branch merging, register usage, and so on).
Compile with switch -S to get the assembly code. This should work for any level of optimization.
For instance, to get the assembly code generated in O2 mode, try:
g++/gcc -S -O2 input.cpp
a corresponding input.s will be generated, which contains the assembly code generated. Repeat this for any optimization level you want.
gcc/clang performs optimizations on the intermediate representations (IR), which can be printed after each optimization pass.
for gcc it is (-fdump-tree-all) 'http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html'
with clang it is (-llvm -print-after-all).
Clang/gcc offers many more options to analyze the optimizations. It is easy to turn on/off an optimization from the command line (http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html, http://llvm.org/docs/Passes.html)
with clang-llvm you can also list the optimization passes which were performed using the command line option (-mllvm -debug-pass=Structure)
If you would like to study compiler optimization and are agnostic to the compiler, then take a look at the Clang/LLVM projects. Clang is a C compiler that can output LLVM IR and LLVM commands can apply specific optimization passes individually.
Output LLVM IR:
clang test.c -S -emit-llvm -o test.ll
Perform optimization pass:
opt test.ll -<optimization_pass> -S -o test_opt.ll
Compile to assembly:
llc test.ll -o test.s
Solution1:
gcc -O1 -S test.c (the capital O and capital S)
Solution2:
This site can also help you. You can use -O0, -O1, .. whatever suitable compiler options to get what you want.
Example from that site: (tested by both solutions)
void maxArray(double* x, double* y) {
for (int i = 0; i < 65536; i++) {
if (y[i] > x[i]) x[i] = y[i];
}
}
Compier Option -O0:
result:
maxArray(double*, double*):
pushq %rbp
movq %rsp, %rbp
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L5:
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -32(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1
ucomisd %xmm1, %xmm0
jbe .L3
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rcx
movq -32(%rbp), %rax
addq %rcx, %rax
movq (%rax), %rax
movq %rax, (%rdx)
.L3:
addl $1, -4(%rbp)
.L2:
cmpl $65535, -4(%rbp)
jle .L5
popq %rbp
ret
Compiler Option -O1:
result:
maxArray(double*, double*):
movl $0, %eax
.L5:
movsd (%rsi,%rax), %xmm0
ucomisd (%rdi,%rax), %xmm0
jbe .L2
movsd %xmm0, (%rdi,%rax)
.L2:
addq $8, %rax
cmpq $524288, %rax
jne .L5
rep; ret
I want to increment a TLS variable in assembly but is gives a segmentation fault in the assembly code. I don't want to let compiler change any other register or memory. Is there a way to do this without using gcc input and output syntax?
__thread unsigned val;
int main() {
val = 0;
asm("incl %gs:val");
return 0;
}
If you really really need to be able to do this for some reason, you should access a thread-local variable from assembly language by preloading its address in C, like this:
__thread unsigned val;
void incval(void)
{
unsigned *vp = &val;
asm ("incl\t%0" : "+m" (*vp));
}
This is because the code sequence required to access a thread-local variable is different for just about every OS and CPU combination supported by GCC, and also varies if you're compiling for a shared library rather than an executable (i.e. with -fPIC). The above construct allows the compiler to emit the correct code sequence for you. In cases where it is possible to access the thread-local variable without any extra instructions, the address generation will be folded into the assembly operation. By way of illustration, here is how gcc 4.7 for x86/Linux compiles the above in several different possible modes (I've stripped out a bunch of assembler directives in all cases, for clarity)...
# -S -O2 -m32 -fomit-frame-pointer
incval:
incl %gs:val#ntpoff
ret
# -S -O2 -m64
incval:
incl %fs:val#tpoff
ret
# -S -O2 -m32 -fomit-frame-pointer -fpic
incval:
pushl %ebx
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
leal val#tlsgd(,%ebx,1), %eax
call ___tls_get_addr#PLT
incl (%eax)
popl %ebx
ret
# -S -O2 -m64 -fpic
incval:
.byte 0x66
leaq val#tlsgd(%rip), %rdi
.value 0x6666
rex64
call __tls_get_addr#PLT
incl (%rax)
ret
Do realize that all four examples would be different if I'd compiled for x86/OSX, and different yet again for x86/Windows.
I have the following piece of code that I wrote in C. Its fairly simple as it just right bit-shifts x for every loop of for.
int main() {
int x = 1;
for (int i = 0; i > -2; i++) {
x >> 2;
}
}
Now the strange thing that is happening is that when I just compile it without any optimizations or with first level optimization (-O), it runs just fine (I am timing the executable and its about 1.4s with -O and 5.4s without any optimizations.
Now when I add -O2 or -O3 switch for compilation and time the resulting executable, it doesn't stop (I have tested for up to 60s).
Any ideas on what might be causing this?
The optimized loop is producing an infinite loop which is a result of you depending on signed integer overflow. Signed integer overflow is undefined behavior in C and should not be depended on. Not only can it confuse developers it may also be optimized out by the compiler.
Assembly (no optimizations): gcc -std=c99 -S -O0 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $1, -4(%rbp)
movl $0, -8(%rbp)
jmp L2
L3:
incl -8(%rbp)
L2:
cmpl $-2, -8(%rbp)
jg L3
movl $0, %eax
leave
ret
Assembly (optimized level 3): gcc -std=c99 -S -O3 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
L2:
jmp L2 #<- infinite loop
You will get the definitive answer by looking at the binary that's produced (using objdump or something).
But as others have noted, this is probably because you're relying on undefined behaviour. One possible explanation is that the compiler is free to assume that i will never be less than -2, and so will eliminate the conditional entirely, and convert this into an infinite loop.
Also, your code has no observable side effects, so the compiler is also free to optimise the entire program away to nothing, if it likes.
Additional information about why integer overflows are undefined can be found here:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Search for the paragraph "Signed integer overflow".