Suppose I have the following C code:
int i = 5;
int j = 10;
int result = i + j;
If I'm looping over this many times, would it be faster to use int result = 5 + 10? I often create temporary variables to make my code more readable, for example, if the two variables were obtained from some array using some long expression to calculate the indices. Is this bad performance-wise in C? What about other languages?
A modern optimizing compiler should optimize those variables away, for example if we use the following example in godbolt with gcc using the -std=c99 -O3 flags (see it live):
#include <stdio.h>
void func()
{
int i = 5;
int j = 10;
int result = i + j;
printf( "%d\n", result ) ;
}
it will result in the following assembly:
movl $15, %esi
for the calculation of i + j, this is form of constant propagation.
Note, I added the printf so that we have a side effect, otherwise func would have been optimized away to:
func:
rep ret
These optimizations are allowed under the as-if rule, which only requires the compiler to emulate the observable behavior of a program. This is covered in the draft C99 standard section 5.1.2.3 Program execution which says:
In the abstract machine, all expressions are evaluated as specified by
the semantics. An actual implementation need not evaluate part of an
expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
Also see: Optimizing C++ Code : Constant-Folding
This is an easy task to optimize for an optimizing compiler. It will delete all variables and replace result with 15.
Constant folding in SSA form is pretty much the most basic optimization there is.
The example you gave is easy for a compiler to optimize. Using local variables to cache values pulled out of global structures and arrays can actually speed up execution of your code. If for instance you are fetching something from a complex structure inside a for loop where the compiler can't optimize and you know the value isn't changing, the local variables can save quite a bit of time.
You can use GCC (other compilers too) to generate the intermediate assembly code and see what the compiler is actually doing.
There is discussion of how to turn on the assembly listings here:Using GCC to produce readable assembly?
It can be instructive to examine the generated code and see what a compiler is actually doing.
While all sorts of trivial differences to the code can perturb the compiler's behavior in ways that mildly improve or worsen performance, in principle it it should not make any performance difference whether you use temp variables like this as long as the meaning of the program is not changed. A good compiler should generate the same, or comparable, code either way, unless you're intentionally building with optimization off in order to get machine code that's as close as possible to the source (e.g. for debugging purposes).
You're suffering the same problem I do when I'm trying to learn what a compiler does--you make a trivial program to demonstrate the problem, and examine the assembly output of the compiler, only to realize that the compiler has optimized everything you tried to get it to do away. You may find even a rather complex operation in main() reduced to essentially:
push "%i"
push 42
call printf
ret
Your original question is not "what happens with int i = 5; int j = 10...?" but "do temporary variables generally incur a run-time penalty?"
The answer is probably not. But you'd have to look at the assembly output for your particular, non-trivial code. If your CPU has a lot of registers, like an ARM, then i and j are very likely to be in registers, just the same as if those registers were storing the return value of a function directly. For example:
int i = func1();
int j = func2();
int result = i + j;
is almost certainly to be exactly the same machine code as:
int result = func1() + func2();
I suggest you use temporary variables if they make the code easier to understand and maintain, and if you're really trying to tighten a loop, you'll be looking into the assembly output anyway to figure out how to finesse as much performance out as possible. But don't sacrifice readability and maintainability for a few nanoseconds, if that's not necessary.
Related
Say you have (for reasons that are not important here) the following code:
int k = 0;
... /* no change to k can happen here */
if (k) {
do_something();
}
Using the -O2 flag, GCC will not generate any code for it, recognizing that the if test is always false.
I'm wondering if this is a pretty common behaviour across compilers or it is something I should not rely on.
Does anybody knows?
Dead code elimination in this case is trivial to do for any modern optimizing compiler. I would definitely rely on it, given that optimizations are turned on and you are absolutely sure that the compiler can prove that the value is zero at the moment of check.
However, you should be aware that sometimes your code has more potential side effects than you think.
The first source of problems is calling non-inlined functions. Whenever you call a function which is not inlined (i.e. because its definition is located in another translation unit), compiler assumes that all global variables and the whole contents of the heap may change inside this call. Local variables are the lucky exception, because compiler knows that it is illegal to modify them indirectly... unless you save the address of a local variable somewhere. For instance, in this case dead code won't be eliminated:
int function_with_unpredictable_side_effects(const int &x);
void doit() {
int k = 0;
function_with_unpredictable_side_effects(k);
if (k)
printf("Never reached\n");
}
So compiler has to do some work and may fail even for local variables. By the way, I believe the problem which is solved in this case is called escape analysis.
The second source of problems is pointer aliasing: compiler has to take into account that all sort of pointers and references in your code may be equal, so changing something via one pointer may change the contents at the other one. Here is one example:
struct MyArray {
int num;
int arr[100];
};
void doit(int idx) {
MyArray x;
x.num = 0;
x.arr[idx] = 7;
if (x.num)
printf("Never reached\n");
}
Visual C++ compiler does not eliminate the dead code, because it thinks that you may access x.num as x.arr[-1]. It may sound like an awful thing to do to you, but this compiler has been used in gamedev area for years, and such hacks are not uncommon there, so the compiler stays on the safe side. On the other hand, GCC removes the dead code. Maybe it is related to its exploitation of strict pointer aliasing rule.
P.S. The const keywork is never used by optimizer, it is only present in C/C++ language for programmers' convenience.
There is no pretty common behaviour across compilers. But there is a way to explore how different compilers acts with specific part of code.
Compiler explorer will help you to answer on every question about code generation, but of course you must be familiar with assembler language.
I don't understand exactly the following:
When using debugging and optimization together, the internal rearrangements carried out by the optimizer can make it difficult to see what is going on when examining an optimized program in the debugger. For example, the ordering of statements may be changed.
What i understand is when i build a program with the -g option, then the executable will contain a symbolic table which contains variable, function names, references to them and their line-numbers.
And when i build with an optimization option, for example the ordering of instructions may be changed depends on the optimization.
What i don't understand is, why debugging is more difficult.
I would like to see an example, and an easy to understand explanation.
An example that might happen:
int calc(int a, int b)
{
return a << b + 7;
}
int main()
{
int x = 5;
int y = 7;
int val = calc(x, y);
return val;
}
Optimized this might be the same as
int main()
{
return 642;
}
A contrived example, but trying to debug that kind of optimization in actual code isn’t simple. Some debuggers may show all lines of code marked when stepping through, some might skip them all, some may be confused. And the developer at least is.
simple example:
int a = 4;
int b = a;
int c = b;
printf("%d", c);
can be optimized as:
printf("%d", 4);
In fact in optimized compiles, the compiler might well do exactly this (in machine code of course)
When debugging we the debugger will allow us to inspect the memory associated by a,b and c but when the top version get optimized into the bottom version a,b and c no longer exist in RAM. This makes inspecting RAM a lot harder to figure out what is going on.
When you compile using the optimization flag you are ensured that the output of the program will be compliant to the code you wrote, but the code itself will variate from the one you actually compiled.
As you pointed out, the code will be rearranged and some call will be performed differently. Also another optimization could be loop unrolling, branch prediction and functions calls simplification. These optimizations will also vary on the architecture you are running on.
For all these reasons (and others) your code may become very difficult to debug, since it is transparent for you what the compiler exactly does, thus meaning that the code you want to debug may not look like the one you wrote.
Disclaimer: The following is a purely academic question; I keep this code at least 100 m away from any production system. The problem posed here is something that cannot be measured in any “real life” case.
Consider the following code (godbolt link):
#include <stdlib.h>
typedef int (*func_t)(int *ptr); // functions must conform to this interface
extern int uses_the_ptr(int *ptr);
extern int doesnt_use_the_ptr(int *ptr);
int foo() {
// actual selection is complex, there are multiple functions,
// but I know `func` will point to a function that doesn't use the argument
func_t func = doesnt_use_the_ptr;
int *unused_ptr_arg = NULL; // I pay a zeroing (e.g. `xor reg reg`) in every compiler
int *unused_ptr_arg; // UB, gcc zeroes (thanks for saving me from myself, gcc), clang doesn't
int *unused_ptr_arg __attribute__((__unused__)); // Neither zeroing, nor UB, this is what I want
return (*func)(unused_ptr_arg);
}
The compiler has no reasonable way to know that unused_ptr_arg is unneeded (and so the zeroing is wasted time), but I do, so I want to inform the compiler that unused_ptr_arg may have any value, such as whatever happens to be in the register that would be used for passing it to func.
Is there a way to do this? I know I’m way outside the standard, so I’ll be fine with compiler-specific extensions (especially for gcc & clang).
Using GCC/Clang `asm` Construct
In GCC and Clang, and other compilers that support GCC’s extended assembly syntax, you can do this:
int *unused_ptr_arg;
__asm__("" : "=x" (unused_ptr_arg));
return (*func)(unused_ptr_arg);
That __asm__ construct says “Here is some assembly code to insert into the program at this point. It writes a result to unused_ptr_arg in whatever location you choose for it.” (The x constraint means the compiler may choose memory, a processor register, or anything else the machine supports.) But the actual assembly code is empty (""). So no assembly code is generated, but the compiler believes that unused_ptr_arg has been initialized. In Clang 6.0.0 and GCC 7.3 (latest versions currently at Compiler Explorer) for x86-64, this generates a jmp with no xor.
Using Standard C
Consider this:
int *unused_ptr_arg;
(void) &unused_ptr_arg;
return (*func)(unused_ptr_arg);
The purpose of (void) &unused_ptr_arg; is to take the address of unused_ptr_arg, even though the address is not used. This disables the rule in C 2011 [N1570] 6.3.2.1 2 that says behavior is undefined if a program uses the value of an uninitialized object of automatic storage duration that could have been declared with register. Because its address is taken, it could not have been declared with register, and therefore using the value is no longer undefined behavior according to this rule.
In consequence, the object has an indeterminate value. Then there is an issue of whether pointers may have a trap representation. If pointers do not have trap representations in the C implementation being used, then no trap will occur due to merely referring to the value, as when passing it as an argument.
The result with Clang 6.0.0 at Compiler Explorer is a jmp instruction with no setting of the parameter register, even if -Wall -Werror is added to the compiler options. In contrast, if the (void) line is removed, a compiler error results.
int *unused_ptr_arg = NULL;
This is what you should be doing. You don't pay for anything. Zeroing an int is a no-op. Ok technically it's not, but practically it is. You will never ever ever see the time of this operation in your program. And I don't mean that it's so small that you won't notice it. I mean that it's so small that so many other factors and operations that are order of magnitude longer will "swallow" it.
This is not actually possible across all architectures for a very good reason.
A call to a function may need to spill its arguments to the stack, and in IA64, spilling uninitialized registers to the stack can crash because the previous contents of the register was a speculative load that loaded an address that wasn't mapped.
To prevent the possibility of zero-ing with each run of int foo(), simply make unused_ptr_arg static.
int foo() {
func_t func = doesnt_use_the_ptr;
static int *unused_ptr_arg;
return (*func)(unused_ptr_arg);
}
Let's say I have the following function to initialize a data structure:
void init_data(struct data *d) {
d->x = 5;
d->y = 10;
}
However, with the following code:
struct data init_data(void) {
struct data d = { 5, 10 };
return d;
}
Wouldn't this be optimized away due to copy elision and be just as performant as the former version?
I tried to do some tests on godbolt to see if the assembly was the same, but when using any optimization flags everything was always entirely optimized away, with nothing left but something like this: movabsq $42949672965, %rax, and I am not sure if the same would happen in real code.
The first version I provided seems to be very common in C libraries, and I do not understand why as they should be both just as fast with RVO, with the latter requiring less code.
The first version I provided seems to be very common in C libraries, and I do not understand why as they should be both just as fast with
RVO, with the latter requiring less code.
The main reason for the first being so common is historic. The second way of initializing structures from literals was not standard (well, it was, but only for static initializers and never for automatic variables) and it's never allowed on assignments (well, I've not checked the status of the recent standards) Even, in ancient C, a simple assignment as:
struct A a, b;
...
a = b; /* this was not allowed a long time ago */
was not accepted at all.
So, in order to be able to compile code in every platform, you have to write the old way, as normally, modern compilers allow you to compile legacy code, while the opposite (old compilers accepting new code) is not possible.
And this also applies to returning structures or passing them by value. Apart of being normally a huge waste of resources (it's common to see the whole structure being copied in the stack or copied back to the proper place, once the function returns) old compilers didn't accept these, so to be portable, you must avoid to use these constructs.
Finally a comment: don't use your compiler to check if both constructs generate the same code, as probably it does... but you'll get the wrong assumption that this is common, and you'll run into error. Another different implementation can (and is allowed to do) different translation and result in different code.
Consider the following:
volatile uint32_t i;
How do I know if gcc did or did not treat i as volatile? It would be declared as such because no nearby code is going to modify it, and modification of it is likely due to some interrupt.
I am not the world's worst assembly programmer, but I play one on TV. Can someone help me to understand how it would differ?
If you take the following stupid code:
#include <stdio.h>
#include <inttypes.h>
volatile uint32_t i;
int main(void)
{
if (i == 64738)
return 0;
else
return 1;
}
Compile it to object format and disassemble it via objdump, then do the same after removing 'volatile', there is no difference (according to diff). Is the volatile declaration just too close to where its checked or modified or should I just always use some atomic type when declaring something volatile? Do some optimization flags influence this?
Note, my stupid sample does not fully match my question, I realize this. I'm only trying to find out if gcc did or did not treat the variable as volatile, so I'm studying small dumps to try to find the difference.
Many compilers in some situations don't treat volatile the way they should. See this paper if you deal much with volatiles to avoid nasty surprises: Volatiles are Miscompiled, and What to Do about It. It also contains the pretty good description of the volatile backed with the quotations from the standard.
To be 100% sure, and for such a simple example check out the assembly output.
Try setting the variable outside a loop and reading it inside the loop. In a non-volatile case, the compiler might (or might not) shove it into a register or make it a compile time constant or something before the loop, since it "knows" it's not going to change, whereas if it's volatile it will read it from the variable space every time through the loop.
Basically, when you declare something as volatile, you're telling the compiler not to make certain optimizations. If it decided not to make those optimizations, you don't know that it didn't do them because it was declared volatile, or just that it decided it needed those registers for something else, or it didn't notice that it could turn it into a compile time constant.
As far as I know, volatile helps the optimizer. For example, if your code looked like this:
int foo() {
int x = 0;
while (x);
return 42;
}
The "while" loop would be optimized out of the binary.
But if you define 'x' as being volatile (ie, volatile int x;), then the compiler will leave the loop alone.
Your little sample is inadequate to show anything. The difference between a volatile variable and one that isn't is that each load or store in the code has to generate precisely one load or store in the executable for a volatile variable, whereas the compiler is free to optimize away loads or stores of non-volatile variables. If you're getting one load of i in your sample, that's what I'd expect for volatile and non-volatile.
To show a difference, you're going to have to have redundant loads and/or stores. Try something like
int i = 5;
int j = i + 2;
i = 5;
i = 5;
printf("%d %d\n", i, j);
changing i between non-volatile and volatile. You may have to enable some level of optimization to see the difference.
The code there has three stores and two loads of i, which can be optimized away to one store and probably one load if i is not volatile. If i is declared volatile, all stores and loads should show up in the object code in order, no matter what the optimization. If they don't, you've got a compiler bug.
It should always treat it as volatile.
The reason the code is the same is that volatile just instructs the compiler to load the variable from memory each time it accesses it. Even with optimization on, the compiler still needs to load i from memory once in the code you've written, because it can't infer the value of i at compile time. If you access it repeatedly, you'll see a difference.
Any modern compiler has multiple stages. One of the fairly easy yet interesting questions is whether the declaration of the variable itself was parsed correctly. This is easy because the C++ name mangling should differ depending on the volatile-ness. Hence, if you compile twice, once with volatile defined away, the symbol tables should differ slightly.
Read the standard before you misquote or downvote. Here's a quote from n2798:
7.1.6.1 The cv-qualifiers
7 Note: volatile is a hint to the implementation to avoid aggressive optimization involving the object because the value of the object might be changed by means undetectable by an implementation. See 1.9 for detailed semantics. In general, the semantics of volatile are intended to be the same in C++ as they are in C.
The keyword volatile acts as a hint. Much like the register keyword. However, volatile asks the compiler to keep all its optimizations at bay. This way, it won't keep a copy of the variable in a register or a cache (to optimize speed of access) but rather fetch it from the memory everytime you request for it.
Since there is so much of confusion: some more. The C99 standard does in fact say that a volatile qualified object must be looked up every time it is read and so on as others have noted. But, there is also another section that says that what constitutes a volatile access is implementation defined. So, a compiler, which knows the hardware inside out, will know, for example, when you have an automatic volatile qualified variable and whose address is never taken, that it will not be put in a sensitive region of memory and will almost certainly ignore the hint and optimize it away.
This keyword finds usage in setjmp and longjmp type of error handling. The only thing you have to bear in mind is that: You supply the volatile keyword when you think the variable may change. That is, you could take an ordinary object and manage with a few casts.
Another thing to keep in mind is the definition of what constitutes a volatile access is left by standard to the implementation.
If you really wanted different assembly compile with optimization