Optimizing out helper functions

Optimizing out helper functions - c

This is more or less a request for clarification on
Casting a function pointer to another type with example code
struct my_struct;
void my_callback_function(struct my_struct* arg);
void do_stuff(void (*cb)(void*));
static void my_callback_helper(void* pv)
{
my_callback_function(pv);
}
int main()
{
do_stuff(&my_callback_helper);
}
The answer says a "good" compiler should be able to optimize out
the my_callback_helper() function but I found no compiler at https://gcc.godbolt.org
that does it and the helper function gets always generated even if it's just a jump to my_callback_function() (-O3):
my_callback_helper:
jmp my_callback_function
main:
subq $8, %rsp
movl $my_callback_helper, %edi
call do_stuff
xorl %eax, %eax
addq $8, %rsp
ret
So my question is: Is there anything in the standard that prevents compilers from eliminating the helper?

There's nothing in the standard that directly prevents this optimization. But in practice, it's not always possible for compilers when they don't have a "full picture".
You have taken the address of my_callback_helper. So compiler can't easily optimize it out because it doesn't know what do_stuff does with it. In a separate module where do_stuff is defined, compiler doesn't know that it can simply use/call my_callback_function in place of its argument (my_callback_helper). In order to optimize out my_callback_helper completely, compiler has to know what do_stuff does as well. But do_stuff is an external function whose definition isn't available to compiler. So this sort of optimization may happen if you provide a definition for do_stuff and all its uses.

Related

Why is tailcall optimization not performed for types of class MEMORY?

I'm trying to understand the implication of System V AMD64 - ABI for returning by value from a function.
For the following data type
struct Vec3{
double x, y, z;
};
the type Vec3 is of class MEMORY and thus the following is specified by the ABI concerning "Returning of Values":
If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in %rdi as if
it were the first argument to the function. In effect, this address
becomes a “hidden” first argument. This storage must not overlap any
data visible to the callee through other names than this argument.
On
return %rax will contain the address that has been passed in by the
caller in %rdi.
With this in mind, the following (silly) function:
struct Vec3 create(void);
struct Vec3 use(){
return create();
}
could be compiled as:
use_v2:
jmp create
In my opinion, tailcall-optimization can be performed, as we are assured by the ABI, that create will place in %rdi passed value into %rax register.
However, none of the compilers (gcc, clang, icc) seem to be performing this optimization (here on godbolt). The resulting assembly code saves %rdi on stack only to be able move its value to %rax, for example gcc:
use:
pushq %r12
movq %rdi, %r12
call create
movq %r12, %rax
popq %r12
ret
Neither for this minimal, silly function nor for more complicated ones from real life, tailcall-optimization is performed. Which leads me to believe, that I must be missing something, which prohibits it.
Needless to say, for types of class SSE (e.g. only 2 and not 3 doubles), tailcall-optimization is performed (at least by gcc and clang, live on godbolt):
struct Vec2{
double x, y;
};
struct Vec2 create(void);
struct Vec2 use(){
return create();
}
results in
use:
jmp create

Looks like a missed optimization bug that you should report, if there isn't already a duplicate open for gcc and clang.
(It's not rare for both gcc and clang to have the same missed optimization in cases like this; do not assume that something is illegal just because compilers don't do it. The only useful data is when compilers do perform an optimization: it's either a compiler bug or at least some compiler devs decided it was safe according to their interpretation of whatever standards.)
We can see GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is the missed optimization that's blocking tailcall optimization.
The ABI requires a function with a MEMORY-type return value to return the "hidden" pointer in RAX1.
GCC/clang do already realize they can elide actual copying by passing along their own return-value space, instead of allocating fresh space. But to do tailcall optimization, they'd have to realize that they can leave their callee's RAX value in RAX, instead of saving their incoming RDI in a call-preserved register.
If the ABI didn't require returning the hidden pointer in RAX, I expect gcc/clang would have had no problem with passing along the incoming RDI as part of an optimized tailcall.
Generally compilers like to shorten dependency chains; that's probably what's going on here. The compiler doesn't know that the latency from rdi arg to rax result of create() is probably just one mov instruction. Ironically, this could be a pessimization if the callee saves/restores some call-preserved registers (like r12), introducing a store/reload of the return-address pointer. (But that mostly only matters if anything even uses it. I did get some clang code to do so, see below.)
Footnote 1: Returning the pointer sounds like a good idea, but almost invariably the caller already knows where it put the arg in its own stack frame and will just use an addressing mode like 8(%rsp) instead of actually using RAX. At least in compiler-generated code, the RAX return value will typically go unused. (And if necessary, the caller can always save it somewhere themselves.)
As discussed in What prevents the usage of a function argument as hidden pointer? there are serious obstacles to using anything other than space in the caller's stack frame to receive a retval.
Having the pointer in a register just saves an LEA in the caller if the caller wants to store the address somewhere, if it is a static or stack address.
However, this case is close to one where it would be useful. If we're passing along our own retval space to a child function, we might want to modify that space after the call. Then it is useful for easy access to that space, e.g. to modify a return value before we return.
#define T struct Vec3
T use2(){
T tmp = create();
tmp.y = 0.0;
return tmp;
}
Efficient handwritten asm:
use2:
callq create
movq $0, 8(%rax)
retq
Actual clang asm at least still uses return-value optimization, vs. GCC9.1 copying. (Godbolt)
# clang -O3
use2: # #use2
pushq %rbx
movq %rdi, %rbx
callq create
movq $0, 8(%rbx)
movq %rbx, %rax
popq %rbx
retq
This ABI rule perhaps exists specifically for this case, or maybe the ABI designers were picturing that the retval space might be newly-allocated dynamic storage (which the caller would have to save a pointer to if the ABI didn't provide it in RAX). I didn't try that case.

System V AMD64 - ABI will return data from a function in registers RDX and RAX or XMM0 and XMM1. Looking at Godbolt the optimization seems to be based on size. The compiler will only return up to 2 double or 4 float in registers.
Compilers miss optimizations all the time. The C language does not have tail-call optimization, unlike Scheme. GCC and Clang have said that they have no plans to try and guarantee tail-call optimization. It sounds like OP could try asking the compiler developers or open a bug with said compilers.

Return value from writing an unused parameter when falling off the end of a non-void function

In this golfing answer I saw a trick where the return value is the second parameter which is not passed in.
int f(i, j)
{
j = i;
}
int main()
{
return f(3);
}
From gcc's assembly output it looks like when the code copies j = i it stores the result in eax which happens to be the return value.
f:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $3, %edi
movl $0, %eax
call f
popq %rbp
ret
So, did this happen just by being lucky? Is this documented by gcc? It only works with -O0, but it works with a bunch of values of i I tried, -m32, and a bunch of different versions of GCC.

gcc -O0 likes to evaluate expressions in the return-value register, if a register is needed at all. (GCC -O0 generally just likes to have values in the retval register, but this goes beyond picking that as the first temporary.)
I've tested a bit, and it really looks like GCC -O0 does this on purpose across multiple ISAs, sometimes even using an extra mov instruction or equivalent. IIRC I made an expression more complicated so the result of evaluation ended up in another register, but it still copied it back to the retval register.
Things like x++ that can (on x86) compile to a memory-destination inc or add won't leave the value in a register, but assignments typically will. So it's note quite like GCC is treating function bodies like GNU C statement-expressions.
This is not documented, guaranteed, or standardized by anything. It's an implementation detail, not something intended for you to take advantage of like this.
"Returning" a value this way means you're programming in "GCC -O0", not C. The wording of the code-golf rules says that programs have to work on at least one implementation. But my reading of that is that they should work for the right reasons, not because of some side-effect implementation detail. They break on clang not because clang doesn't support some language feature, just because they're not even written in C.
Breaking with optimization enabled is also not cool; some level of UB is generally acceptable in code golf, like integer wraparound or pointer-casting type punning being things that one might reasonably wish were well-defined. But this is pure abuse of an implementation detail of one compiler, not a language feature.
I argued this point in comments under the relevant answer on Codegolf.SE C golfing tips Q&A (Which incorrectly claims it works beyond GCC). That answer has 4 downvotes (and deserves more IMO), but 16 upvotes. So some members of the community disagree that this is terrible and silly.
Fun fact: in ISO C++ (but not C), having execution fall off the end of a non-void function is Undefined Behaviour, even if the caller doesn't use the result. This is true even in GNU C++; outside of -O0 GCC and clang will sometimes emit code like ud2 (illegal instruction) for a path of execution that reaches the end of a function without a return. So GCC doesn't in general define the behaviour here (which implementations are allowed to do for things that ISO C and C++ leaves undefined. e.g. gcc -fwrapv defines signed overflow as 2's complement wraparound.)
But in ISO C, it's legal to fall off the end of a non-void function: it only becomes UB if the caller uses the return value. Without -Wall GCC may not even warn. Checking return value of a function without return statement
With optimization disabled, function inlining won't happen so the UB isn't really compile-time visible. (Unless you use __attribute__((always_inline))).
Passing a 2nd arg merely gives you something to assign to. It's not important that it's a function arg. But i=i; optimizes away even with -O0 so you do need a separate variable. Also just i; optimizes away.
Fun fact: a recursive f(i){ f(i); } function body does bounce i through EAX before copying it to the first arg-passing register. So GCC just really loves EAX.
movl -4(%rbp), %eax
movl %eax, %edi
movl $0, %eax # without a full prototype, pass # of FP args in AL
call f
i++; doesn't load into EAX; it just uses a memory-destination add without loading into a register. Worth trying with gcc -O0 for ARM.

inlining when function body has no return statement

I read that compiler may not perform inlining when "return" statement does not exist in function body and also in the case where return type is other than void. If it's like inlining cannot happen with functions that return anything other than void, why it needed a "return" statement for making the function inline. Assume simple code as below:
Here the function declared as inline does not have a "return" statement in its body. Does inlining happen here? Is there any way through which we can know if the inline request has been accepted and executed by compiler?
#include<stdio.h>
inline void call()
{
printf("*****In call*****\n");
}
main()
{
call();
}

This is obviously a compiler specific question, but since you are using gcc, here is what gcc produces:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
call puts
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
where .LC0 is your hardcoded string (complete assembly dump). As you can see there is no call to call here, so yes, gcc does inline this call with -O2.

I read that compiler may not perform inlining when "return" statement
does not exist in function body
This is not at all true. Compiler can certainly inline void functions, but whether it does inline any function is upto it even if you specify inline keyword.
Just see here: https://goo.gl/xEg6AK
This is the generated assembly:
.LC0:
.string "*****In call*****"
main:
subq $8, %rsp
movl $.LC0, %edi
call puts
movl $0, %eax
addq $8, %rsp
ret
GCC does inline your code when compiled with -O. It also replaced the printf call with a simple puts.

The GCC compiler provides a number of extensions to the standard C language that allow you to add 'attributes' to functions (as well as to types and variables which are not relevant here). One of these is __attribute__((always_inline)), which overrides the compiler's inlining algorithm.
#include<stdio.h>
inline void __attribute__((always_inline)) call()
{
printf("*****In call*****\n");
}
main()
{
call();
}
In this case, the instruction(s) to make the call to the printf library routine will be inlined at the point the call function appears in the calling code.

Are C function calls without variables precompiled or evaluated at run time?

Essentially, if I have code:
void main(void){
foo(1,3);
}
Where foo is:
void foo(int x, int y){
if(x==0) return;
else if (x==1){
if(y==0) printf("hello, world");
else if (y==2) printf("goodbye.");
else if (y==3) printf("no.");
else return;
}
else return;
}
Will the conditionals (assuming they apply) be evaluated at run time, or will the 'printf' statements in this case simply compile in the executable, essentially with the compiler evaluating the conditionals?

Will the conditionals (assuming they apply) be evaluated at run time,
or will the 'printf' statements in this case simply compile in the
executable, essentially with the compiler evaluating the conditionals?
Compiler is free to emit whatever code it wants as long as the semantics remain the samecitation needed. Most compilers have configurable optimization levels which control how aggressive they can be in transforming source code. In case of gcc, the relevant flag is -Ox.
The only way to see what code is emitted, is to inspect it oneself. In case of gcc you can use -S flag, that outputs generated assembler.
In your program, gcc -O0 -S opt.c (no optimizations) yield following:
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $3, %esi
movl $1, %edi
call foo # <---
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
Whereas gcc -O1 -S opt.c and higher optimization levels result in:
.LC2:
.string "no."
(...)
main:
.LFB12:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC2, %edi
movl $0, %eax
call printf # <----
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc

The compiler cannot interpret the code in function foo(). It will generate the code for ifs and printf()s into the function's body.
There are several reasons it doesn't do that. One of them is the linkage of the function. It is not declared as static and that means it may be used in other .c files; the compiler cannot just guess what are the values of its arguments on the actual call.
And calling it with different arguments to output different things is the reason you wrote the function in the first place.
Depending on the compiler and the optimization switches you use when you invoke it, it can inline the call to foo(1,3). Inlining means the compiler replaces the call to the function with the code of the function's body. In this case it can optimize the inlined code because it knows the values of the arguments and it can tell which printf() runs; it removes the ifs and the other printf()s because they are dead code and instead of a call to foo(1,3) it generates the code for printf("no.");. But this can happen only because the arguments of your function call are constant (i.e. they are known at the compile time).
However, even in this case, the code for function is still generated.
If the call foo(1,3); is the only one call to the function and the compiler is able to inline it, the function's code will be removed (will be ignored because it is not called) by the linker, when it generates the final executable.
Check the command line switches of your compiler for optimization flags. Also check how you can instruct it to generate an assembly file (with comments) to see what code it generates (you can see there if it inlines the call to foo(1,3) or not).

After disassembly the foo function like this:
011714AE push 3
011714B0 push 1
011714B2 call foo (11711D6h)
that is mean C function push variables into memory first after that get back from esp for evaluate at run time

evaluating/accessing a structure

Consider the two slightly different versions of the same code:
struct s
{
int dummy[1];
};
volatile struct s s;
int main(void)
{
s;
return 0;
}
and
struct s
{
int dummy[16];
};
volatile struct s s;
int main(void)
{
s;
return 0;
}
Here's what I'm getting with gcc 4.6.2 for them:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
movl _s, %eax
xorl %eax, %eax
leave
ret
.comm _s, 4, 2
and
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
xorl %eax, %eax
leave
ret
.comm _s, 64, 5
Please note the absence of access to s in the second case.
Is it a compiler bug or am I just dealing with the following statement of the C standard and the gcc developers simply chose such a weird implementation-definedness and are still playing by the rules?:
What constitutes an access to an object that has volatile-qualified type is implementation-defined.
What would be the reason for this difference? I'd naturally expect the whole structre being accessed (or not accessed, I'm not sure), irrespective of its size and of what's inside it.
P.S. What does your compiler (non-gcc or newer gcc) do in this case? (please answer this last question in a comment if that's the only part you're going to address, as this isn't the main question being asked, but more of a curiosity question).

There is a difference between C and C++ for this question which explains what's going on.
clang-3.4
When compiling either of these snippets as C++, the emitted assembly didn't reference s in either case. In fact a warning was issued for both:
volatile.c:8:2: warning: expression result unused; assign into a variable to force a volatile load [-Wunused-volatile-lvalue]
s;
These warnings were not issued when compiling in C99 mode. As mentioned in this blog post and this GCC wiki entry from the question comments, using s in this context causes an lvalue-to-rvalue conversion in C, but not in C++. This is confirmed by examining the Clang AST for C, as there is an ImplicitCastExpr from LvalueToRValue, which does not exist in the AST generated from C++. (The AST is not affected by the size of the struct).
A quick grep of the Clang source reveals this in the emission of aggregate expressions:
case CK_LValueToRValue:
// If we're loading from a volatile type, force the destination
// into existence.
if (E->getSubExpr()->getType().isVolatileQualified()) {
EnsureDest(E->getType());
return Visit(E->getSubExpr());
}
EnsureDest forces the emission of a stack slot, sized and typed for the expression. As the optimizers are not allowed to remove volatile accesses, they remain as a scalar load/store and a memcpy respectively in both the IR and output asm. This is the behavior I would expect, given the above.
gcc-4.8.2
Here, I observe the same behavior as in the question. However when I change the expression from s; to s.dummy;, the access does not appear in either version. I'm not familiar with the internals of gcc as I am with LLVM so I can't speculate why this would happen. But based on the above observations, I would say this is a compiler bug due to inconsistency.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Optimizing out helper functions - c

Related

Why is tailcall optimization not performed for types of class MEMORY?

Return value from writing an unused parameter when falling off the end of a non-void function

inlining when function body has no return statement

Are C function calls without variables precompiled or evaluated at run time?

evaluating/accessing a structure

Categories

Resources