Let's say I have pseudocode like this:
main() {
BOOL b = get_bool_from_environment(); //get it from a file, network, registry, whatever
while(true) {
do_stuff(b);
}
}
do_stuff(BOOL b) {
if(b)
path_a();
else
path_b();
}
Now, since we know that the external environment can influence get_bool_from_environment() to potentially produce either a true or false result, then we know that the code for both the true and false branches of if(b) must be included in the binary. We can't simply omit path_a(); or path_b(); from the code.
BUT -- we only set BOOL b the one time, and we always reuse the same value after program initialization.
If I were to make this valid C code and then compile it using gcc -O0, the if(b) would be repeatedly evaluated on the processor each time that do_stuff(b) is invoked, which inserts what are, in my opinion, needless instructions into the pipeline for a branch that is basically static after initialization.
If I were to assume that I actually had a compiler that was as stupid as gcc -O0, I would re-write this code to include a function pointer, and two separate functions, do_stuff_a() and do_stuff_b(), which don't perform the if(b) test, but simply go ahead and perform one of the two paths. Then, in main(), I would assign the function pointer based on the value of b, and call that function in the loop. This eliminates the branch, though it admittedly adds a memory access for the function pointer dereference (due to architecture implementation I don't think I really need to worry about that).
Is it possible, even in principle, for a compiler to take code of the same style as the original pseudocode sample, and to realize that the test is unnecessary once the value of b is assigned once in main()? If so, what is the theoretical name for this compiler optimization, and can you please give an example of an actual compiler implementation (open source or otherwise) which does this?
I realize that compilers can't generate dynamic code at runtime, and the only types of systems that could do that in principle would be bytecode virtual machines or interpreters (e.g. Java, .NET, Ruby, etc.) -- so the question remains whether or not it is possible to do this statically and generate code that contains both the path_a(); branch and the path_b() branch, but avoid evaluating the conditional test if(b) for every call of do_stuff(b);.
If you tell your compiler to optimise, you have a good chance that the if(b) is evaluated only once.
Slightly modifying the given example, using the standard _Bool instead of BOOL, and adding the missing return types and declarations,
_Bool get_bool_from_environment(void);
void path_a(void);
void path_b(void);
void do_stuff(_Bool b) {
if(b)
path_a();
else
path_b();
}
int main(void) {
_Bool b = get_bool_from_environment(); //get it from a file, network, registry, whatever
while(1) {
do_stuff(b);
}
}
the (relevant part of the) produced assembly by clang -O3 [clang-3.0] is
callq get_bool_from_environment
cmpb $1, %al
jne .LBB1_2
.align 16, 0x90
.LBB1_1: # %do_stuff.exit.backedge.us
# =>This Inner Loop Header: Depth=1
callq path_a
jmp .LBB1_1
.align 16, 0x90
.LBB1_2: # %do_stuff.exit.backedge
# =>This Inner Loop Header: Depth=1
callq path_b
jmp .LBB1_2
b is tested only once, and main jumps into an infinite loop of either path_a or path_b depending on the value of b. If path_a and path_b are small enough, they would be inlined (I strongly expect). With -O and -O2, the code produced by clang would evaluate b in each iteration of the loop.
gcc (4.6.2) behaves similarly with -O3:
call get_bool_from_environment
testb %al, %al
jne .L8
.p2align 4,,10
.p2align 3
.L9:
call path_b
.p2align 4,,6
jmp .L9
.L8:
.p2align 4,,8
call path_a
.p2align 4,,8
call path_a
.p2align 4,,5
jmp .L8
oddly, it unrolled the loop for path_a, but not for path_b. With -O2 or -O, it would however call do_stuff in the infinite loop.
Hence to
Is it possible, even in principle, for a compiler to take code of the same style as the original pseudocode sample, and to realize that the test is unnecessary once the value of b is assigned once in main()?
the answer is a definitive Yes, it is possible for compilers to recognize this and take advantage of that fact. Good compilers do when asked to optimise hard.
If so, what is the theoretical name for this compiler optimization, and can you please give an example of an actual compiler implementation (open source or otherwise) which does this?
I don't know the name of the optimisation, but two implementations doing that are gcc and clang (at least, recent enough releases).
Related
In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
A bit of context here: I told a friend that I wrote a program not using dynamic memory allocation ("malloc") and allocate all memory static (by modeling all my state variables in a struct, which I then declared global). He then told me that if I'm using automatic variables, I still make use of dynamic memory. I argued that my automatic variables are not state variables but control variables, so my program is still to be considered static. We then discussed that there has to be a way to make a statement about the absolute worst-case behaviour about my program, so I came up with the above question.
Bonus question: If the assumptions above hold, I could simply declare all automatic variables static and would end up with a "truly" static program?
Even if array sizes are constant a C implementation could allocate arrays and even structures dynamically. I'm not aware of any that do (anyone) and it would appear quite unhelpful. But the C Standard doesn't make such guarantees.
There is also (almost certainly) some further overhead in the stack frame (the data added to the stack on call and released on return).
You would need to declare all your functions as taking no parameters and returning void to ensure no program variables in the stack. Finally the 'return address' of where execution of a function is to continue after return is pushed onto the stack (at least logically).
So having removed all parameters, automatic variables and return values to you 'state' struct there will still be something going on to the stack - probably.
I say probably because I'm aware of a (non-standard) embedded C compiler that forbids recursion that can determine the maximum size of the stack by examining the call tree of the whole program and identify the call chain that reaches the peek size of the stack.
You could achieve this a monstrous pile of goto statements (some conditional where a functon is logically called from two places or by duplicating code.
It's often important in embedded code on devices with tiny memory to avoid any dynamic memory allocation and know that any 'stack-space' will never overflow.
I'm happy this is a theoretical discussion. What you suggest is a mad way to write code and would throw away most of (ultimately limited) services C provides to infrastructure of procedural coding (pretty much the call stack)
Footnote: See the comment below about the 8-bit PIC architecture.
Bonus question: If the assumptions above hold, I could simply declare
all automatic variables static and would end up with a "truly" static
program?
No. This would change the function of the program. static variables are initialized only once.
Compare this 2 functions:
int canReturn0Or1(void)
{
static unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
int willAlwaysReturn0(void)
{
unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
No, because of function pointers..... Read n1570.
Consider the following code, where rand(3) is some pseudo random number generator (it could also be some input from a sensor) :
typedef int foosig(int);
int foo(int x) {
foosig* fptr = (x>rand())?&foo:NULL;
if (fptr)
return (*fptr)(x);
else
return x+rand();
}
An optimizing compiler (such as some recent GCC suitably invoked with enough optimizations) would make a tail-recursive call for (*fptr)(x). Some other compiler won't.
Depending on how you compile that code, it would use a bounded stack or could produce a stack overflow. With some ABI and calling conventions, both the argument and the result could go thru a processor register and won't consume any stack space.
Experiment with a recent GCC (e.g. on Linux/x86-64, some GCC 10 in 2020) invoked as gcc -O2 -fverbose-asm -S foo.c then look inside foo.s. Change the -O2 to a -O0.
Observe that the naive recursive factorial function could be compiled into some iterative machine code with a good enough C compiler and optimizer. In practice GCC 10 on Linux compiling the below code:
int fact(int n)
{
if (n<1) return 1;
else return n*fact(n-1);
}
as gcc -O3 -fverbose-asm tmp/fact.c -S -o tmp/fact.s produces the following assembler code:
.type fact, #function
fact:
.LFB0:
.cfi_startproc
endbr64
# tmp/fact.c:3: if (n<1) return 1;
movl $1, %eax #, <retval>
testl %edi, %edi # n
jle .L1 #,
.p2align 4,,10
.p2align 3
.L2:
imull %edi, %eax # n, <retval>
subl $1, %edi #, n
jne .L2 #,
.L1:
# tmp/fact.c:5: }
ret
.cfi_endproc
.LFE0:
.size fact, .-fact
.ident "GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0"
And you can observe that the call stack is not increasing above.
If you have serious and documented arguments against GCC, please submit a bug report.
BTW, you could write your own GCC plugin which would choose to randomly apply or not such an optimization. I believe it stays conforming to the C standard.
The above optimization is essential for many compilers generating C code, such as Chicken/Scheme or Bigloo.
A related theorem is Rice's theorem. See also this draft report funded by the CHARIOT project.
See also the Compcert project.
How would you define a pointer to a XMM register in asm()?
Like accessing array elements in a loop how can you access registers in asm using a counter?
I tried to do it in the following code:
float *f=(float*)_aligned_malloc(64,16);
for(int i=0;i<4;i++)
asm volatile
(
"movaps (%1),%%xmm%0"
:
:"r"(i),"r"(f+4*i)
:"%xmm%0"
);
But the compiler gives me this error:
unknown register name '%xmm%0' in 'asm'
This sounds like a horrible idea compared to using assembler macros or actually manual unrolling. Your code would totally break if gcc decided not to fully unroll the loop, because it can only work with compile-time constant indexing.
Also, there's no way to tell the compiler which register you're putting the result in, so this is basically useless. I'm only answering as a silly exercise in using GNU C inline-asm syntax, not because this answer is possibly useful in any project.
That said, you can do it using an "i" constraint and a c operand modifier to format the immediate as a bare number, like 1 instead of $1.
void *_aligned_malloc(int, int);
void foo()
{
float *f=(float*)_aligned_malloc(64,16);
for(int i=0;i<4;i++) {
asm volatile (
"movaps %[input],%%xmm%c[regnum]"
:
// only compiles with optimization enabled.
:[regnum] "i"(i), [input] "m"(f[4*i])
:"%xmm0", "%xmm1", "%xmm2", "%xmm3"
);
}
}
gcc and clang, with -O3, are able to fully unroll and make i for each iteration a compile-time constant that can match an "i" constraint. This compiles on Godbolt.
# gcc7.3 -O3
foo():
subq $8, %rsp
movl $16, %esi
movl $64, %edi
call _aligned_malloc(int, int) # from a dummy prototype so it compiles
movaps (%rax),%xmm0
movaps 16(%rax),%xmm1 # compiler can use addressing modes because I switched to an "m" constraint
movaps 32(%rax),%xmm2
movaps 48(%rax),%xmm3
vzeroupper # XMM clobbers also include YMM, and I guess gcc assumes you might have dirtied the upper lanes.
addq $8, %rsp
ret
Note that I've only told the compiler about reading the first float of every group of 4.
ICC -O3 says catastrophic error: Cannot match asm operand constraint even with -O3. With optimization disabled, gcc and clang have the same problem, of course. For example, gcc -O0 will say:
<source>: In function 'void foo()':
<source>:11:10: warning: asm operand 0 probably doesn't match constraints
);
^
<source>:11:10: error: impossible constraint in 'asm'
Compiler returned: 1
Because without optimization, i isn't a compile-time constant and can't match an "i" (immediate) constraint.
Obviously you can't use an "r" constraint; that would fill in the asm template with something like %xmm%eax if the compiler picked eax.
Anyway, this is useless because you can't use destination register. All you can do is tell the compiler that all of the possible destination registers are clobbered. It's not safe to write to a clobbered register in one asm statement and then assume the value is still there in a later asm statement.
x86, like all other architectures, can't index the architectural registers using a runtime value. Register numbers must be hard-coded into the instruction stream.
(Some microcontrollers, like AVR, have memory-mapped registers, so you can index them by indexing the memory that aliases the register file. But this is rare, and x86 doesn't do it. It would interfere with out-of-order execution in a similar way to self-modifying code. And BTW, SMC (or branching to one of 16 different versions of an instruction) is the only option for runtime indexing of the register file.)
You can't -- there is no way to index into the register file.
If you want to use multiple registers in sequence, you will need to unroll the loop and name each of the registers explicitly.
It seems like gcc 4.6.2 removes code it considers unused from functions.
test.c
int main(void) {
goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
Disassembly of main()
0x08048404 <+0>: push ebp
0x08048405 <+1>: mov ebp,esp
0x08048407 <+3>: nop # <-- This is all whats left of my jmp.
0x08048408 <+4>: mov eax,0x0
0x0804840d <+9>: pop ebp
0x0804840e <+10>: ret
Compiler options
No optimizations enabled, just gcc -m32 -o test test.c (-m32 because I'm on a 64 bit machine).
How can I stop this behavior?
Edit: Preferably by using compiler options, not by modifing the code.
Looks like that's just the way it is - When gcc sees that code within a function is unreachable, it removes it. Other compilers might be different.
In gcc, an early phase in compilation is building the "control flow graph" - a graph of "basic blocks", each free of conditions, connected by branches. When emitting the actual code, parts of the graph, which are not reachable from the root, are discarded.
This isn't part of the optimization phase, and is therefore unaffected by compilation options.
So any solution would involve making gcc think that the code is reachable.
My suggestion:
Instead of putting your assembly code in an unreachable place (where GCC may remove it), you can put it in a reachable place, and skip over the problematic instruction:
int main(void) {
goto exit;
exit:
__asm__ __volatile__ (
"jmp 1f\n"
"jmp $0x0\n"
"1:\n"
);
return 0;
}
Also, see this thread about the issue.
I do not believe there is a reliable way using just compile options to solve this. The preferable mechanism is something that will do the job and work on future versions of the compiler regardless of the options used to compile.
Commentary about Accepted Answer
In the accepted answer there is an edit to the original that suggests this solution:
int main(void) {
__asm__ ("jmp exit");
handler:
__asm__ __volatile__("jmp $0x0");
exit:
return 0;
}
First off jmp $0x0 should be jmp 0x0. Secondly C labels usually get translated into local labels. jmp exit doesn't actually jump to the label exit in the C function, it jumps to the exit function in the C library effectively bypassing the return 0 at the bottom of main. Using Godbolt with GCC 4.6.4 we get this non-optimized output (I have trimmed the labels we don't care about):
main:
pushl %ebp
movl %esp, %ebp
jmp exit
jmp 0x0
.L3:
movl $0, %eax
popl %ebp
ret
.L3 is actually the local label for exit. You won't find the exit label in the generated assembly. It may compile and link if the C library is present. Do not use C local goto labels in inline assembly like this.
Use asm goto as the Solution
As of GCC 4.5 (OP is using 4.6.x) there is support for asm goto extended assembly templates. asm goto allows you to specify jump targets that the inline assembly may use:
6.45.2.7 Goto Labels
asm goto allows assembly code to jump to one or more C labels. The GotoLabels section in an asm goto statement contains a comma-separated list of all C labels to which the assembler code may jump. GCC assumes that asm execution falls through to the next statement (if this is not the case, consider using the __builtin_unreachable intrinsic after the asm statement). Optimization of asm goto may be improved by using the hot and cold label attributes (see Label Attributes).
An asm goto statement cannot have outputs. This is due to an internal restriction of the compiler: control transfer instructions cannot have outputs. If the assembler code does modify anything, use the "memory" clobber to force the optimizers to flush all register values to memory and reload them if necessary after the asm statement.
Also note that an asm goto statement is always implicitly considered volatile.
To reference a label in the assembler template, prefix it with ‘%l’ (lowercase ‘L’) followed by its (zero-based) position in GotoLabels plus the number of input operands. For example, if the asm has three inputs and references two labels, refer to the first label as ‘%l3’ and the second as ‘%l4’).
Alternately, you can reference labels using the actual C label name enclosed in brackets. For example, to reference a label named carry, you can use ‘%l[carry]’. The label must still be listed in the GotoLabels section when using this approach.
The code could be written this way:
int main(void) {
__asm__ goto ("jmp %l[exit]" :::: exit);
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
We can use asm goto. I prefer __asm__ over asm since it will not throw warnings if compiling with -ansi or -std=? options.
After the clobbers you can list the jump targets the inline assembly may use. C doesn't actually know if we jump or not as GCC doesn't analyze the actual code in the inline assembly template. It can't remove this jump, nor can it assume what comes after is dead code. Using Godbolt with GCC 4.6.4 the unoptimized code (trimmed) looks like:
main:
pushl %ebp
movl %esp, %ebp
jmp .L2 # <------ this is the goto exit
jmp 0x0
.L2: # <------ exit label
movl $0, %eax
popl %ebp
ret
The Godbolt with GCC 4.6.4 output still looks correct and appears as:
main:
jmp .L2 # <------ this is the goto exit
jmp 0x0
.L2: # <------ exit label
xorl %eax, %eax
ret
This mechanism should also work whether you have optimizations on or off, and shouldn't matter whether you are compiling for 64-bit or 32-bit x86 targets.
Other Observations
When there are no output constraints in an extended inline assembly template the asm statement is implicitly volatile. The line
__asm__ __volatile__("jmp 0x0");
Can be written as:
__asm__ ("jmp 0x0");
asm goto statements are considered implicitly volatile. They don't require a volatile modifier either.
Would this work, make it so gcc can't know its unreachable
int main(void)
{
volatile int y = 1;
if (y) goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
If a compiler thinks it can cheat you, just cheat back: (GCC only)
int main(void) {
{
/* Place this code anywhere in the same function, where
* control flow is known to still be active (such as at the start) */
extern volatile unsigned int some_undefined_symbol;
__asm__ __volatile__(".pushsection .discard" : : : "memory");
if (some_undefined_symbol) goto handler;
__asm__ __volatile__(".popsection" : : : "memory");
}
goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
This solution will not add any additional overhead for meaningless instructions, though only works for GCC when used with AS (as is the default).
Explaination: .pushsection switches text output of the compiler to another section, in this case .discard (which is deleted during linking by default). The "memory" clobber prevents GCC from trying to move other text within the section that will be discarded. However, GCC doesn't realize (and never could because the __asm__s are __volatile__) that anything happening between the 2 statements will be discarded.
As for some_undefined_symbol, that is literally just any symbol that is never being defined (or is actually defined, it shouldn't matter). And since the section of code using it will be discarded during linking, it won't produce any unresolved-reference errors either.
Finally, the conditional jump to the label you want to make appear as though it was reachable does exactly that. Besides that fact that it won't appear in the output binary at all, GCC realizes that it can't know anything about some_undefined_symbol, meaning it has no choice but to assume that both of the if's branches are reachable, meaning that as far as it is concerned, control flow can continue both by reaching goto exit, or by jumping to handler (even though there won't be any code that could even do this)
However, be careful when enabling garbage collection in your linker ld --gc-sections (it's disabled by default), because otherwise it might get the idea to get rid of the still unused label regardless.
EDIT:
Forget all that. Just do this:
int main(void) {
__asm__ __volatile__ goto("" : : : : handler);
goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
Update 2012/6/18
Just thinking about it, one can put the goto exit in an asm block, which means that only 1 line of code needs to change:
int main(void) {
__asm__ ("jmp exit");
handler:
__asm__ __volatile__("jmp $0x0");
exit:
return 0;
}
That is significantly cleaner than my other solution below (and possibly nicer than #ugoren's current one too).
This is pretty hacky, but it seems to work: hide the handler in a conditional that can never be followed under normal conditions, but stop it from being eliminated by stopping the compiler from being able to do its analysis properly with some inline assembler.
int main (void) {
int x = 0;
__asm__ __volatile__ ("" : "=r"(x));
// compiler can't tell what the value of x is now, but it's always 0
if (x) {
handler:
__asm__ __volatile__ ("jmp $0x0");
}
return 0;
}
Even with -O3 the jmp is preserved:
testl %eax, %eax
je .L2
.L3:
jmp $0x0
.L2:
xorl %eax, %eax
ret
(This seems really dodgy, so I hope there is a better way to do this. edit just putting a volatile in front of x works so one doesn't need to do the inline asm trickery.)
I've never heard of a way to prevent gcc from removing unreachable code; it seems that no matter what you do, once gcc detects unreachable code it always removes it (use gcc's -Wunreachable-code option to see what it considers to be unreachable).
That said, you can still put this code in a static function and it won't be optimized out:
static int func()
{
__asm__ __volatile__("jmp $0x0");
}
int main(void)
{
goto exit;
handler:
func();
exit:
return 0;
}
P.S
This solution is particularily handy if you want to avoid code redundancy when implanting the same "handler" code block in more than one place in the original code.
gcc may duplicate asm statements inside functions and remove them during optimisation (even at -O0), so this will never work reliably.
one way to do this reliably is to use a global asm statement (i.e. an asm statement outside of any function). gcc will copy this straight to the output and you can use global labels without any problems.
I know, that everybody hates GOTO and nobody recommends it. But that's not the point. I just want to know, which code is the fastest:
the goto loop
int i=3;
loop:
printf("something");
if(--i) goto loop;
the while loop
int i=3;
while(i--) {
printf("something");
}
the for loop
for(int i=3; i; i--) {
printf("something");
}
Generally speaking, for and while loops get compiled to the same thing as goto, so it usually won't make a difference. If you have your doubts, you can feel free to try all three and see which takes longer. Odds are you'll be unable to measure a difference, even if you loop a billion times.
If you look at this answer, you'll see that the compiler can generate exactly the same code for for, while, and goto (only in this case there was no condition).
The only time I've seen the argument made for goto was in one of W. Richard Stevens' articles or books. His point was that in a very time-critical section of code (I believe his example was the network stack), having nested if/else blocks with related error-handling code could be redone using goto in a way that made a valuable difference.
Personally, I'm not good enough a programmer to argue with Stevens' work, so I won't try. goto can be useful for performance-related issues, but the limits of when that is so are fairly strict.
Write short programs, then do this:
gcc -S -O2 p1.c
gcc -S -O2 p2.c
gcc -S -O2 p3.c
Analyze the output and see if there's any difference. Be sure to introduce some level of unpredictability such that the compiler doesn't optimize the program away to nothing.
Compilers do a great job of optimizing these trivial concerns. I'd suggest not to worry about it, and instead focus on what makes you more productive as a programmer.
Speed and efficiency is a great thing to worry about it, but 99% of the time that involves using proper data structures and algorithms... not worrying about whether a for is faster than a while or a goto, etc.
It is probably both compiler, optimiser and architecture specific.
For example the code if(--i) goto loop; is a conditional test followed by an unconditional branch. A compiler might simply generate corresponding code or it might be smart enough (though a compiler that did not have at least that much smarts may not be worth much), to generate a single conditional branch instruction. while(i--) on the other hand is already a conditional branch at the source level, so translation to a conditional branch at the machine level may be more likley regardless of the sophistication of the compiler implementation or optimiser.
In the end, the difference is likley to be minute and only relevant if a great many iterations are required, and the way you should answer this question is to build the code for the specific target and compiler (and compiler settings) of interest, and either inspect the resultant machine level code or directly measure execution time.
In your examples the printf() in the loop will dominate any timing in any case; something simpler in the loop would make observations of the differences easier. I would suggest an empty loop, and then declaring i volatile to prevent the loop being optimised to nothing.
As long as you're generating the same flow of control as a normal loop, pretty nearly any decent compiler can and will produce the same code whether you use for, while, etc. for it.
You can gain something from using goto, but usually only if you're generating a flow of control that a normal loop simply can't (at least cleanly). A typical example is jumping into the middle of a loop to get a loop and a half construct, which most languages' normal loop statements (including C's) don't provide cleanly.
There is should not be any significant difference between all the loops and the goto. Except the idea, that compiler more probably will not try to optimize the GOTO-things at all.
And there is not a lot of sense trying to optimize compiler-generated stuff in loops. It's more sense to optimize the code inside the loop, or reduce the number of iterations or so on.
I think there will be some code after compiler under nornal condition.
In fact I think goto is very convenient sometimes, although it is hard to read.
On Linux, I compiled the code below into assembly using both g++ and clang++. For more information on how I did that, see here. (Short version: g++ -S -O3 filename.cpp clang++ -S -O3 filename.cpp, and some assembly comments you'll see below to help me out.)
Conclusion/TL;DR at the bottom.
First, I compared label: and goto vs. do {} while. You can't compare a for () {} loop with this (in good faith), because a for loop always evaluates the condition first. This time around, the condition is evaluated only after the loop code has been executed once.
#include <iostream>
void testGoto()
{
__asm("//startTest");
int i = 0;
loop:
std::cout << i;
++i;
if (i < 100)
{
goto loop;
}
__asm("//endTest");
}
#include <iostream>
void testDoWhile()
{
__asm("//startTest");
int i = 0;
do
{
std::cout << i;
++i;
}
while (i < 100);
__asm("//endTest");
}
In both cases, the assembly is the exact same regardless of goto or do {} while, per compiler:
g++:
xorl %ebx, %ebx
leaq _ZSt4cout(%rip), %rbp
.p2align 4,,10
.p2align 3
.L2:
movl %ebx, %esi
movq %rbp, %rdi
addl $1, %ebx
call _ZNSolsEi#PLT
cmpl $100, %ebx
jne .L2
clang++:
xorl %ebx, %ebx
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
movl $_ZSt4cout, %edi
movl %ebx, %esi
callq _ZNSolsEi
addl $1, %ebx
cmpl $100, %ebx
jne .LBB0_1
# %bb.2:
Then I compared label: and goto vs. while {} vs. for () {}. This time around, the condition is evaluated before the loop code has been executed even once.
For goto, I had to invert the condition, at least for the first time. I saw two ways of implementing it, so I tried both ways.
#include <iostream>
void testGoto1()
{
__asm("//startTest");
int i = 0;
loop:
if (i >= 100)
{
goto exitLoop;
}
std::cout << i;
++i;
goto loop;
exitLoop:
__asm("//endTest");
}
#include <iostream>
void testGoto2()
{
__asm("//startTest");
int i = 0;
if (i >= 100)
{
goto exitLoop;
}
loop:
std::cout << i;
++i;
if (i < 100)
{
goto loop;
}
exitLoop:
__asm("//endTest");
}
#include <iostream>
void testWhile()
{
__asm("//startTest");
int i = 0;
while (i < 100)
{
std::cout << i;
++i;
}
__asm("//endTest");
}
#include <iostream>
void testFor()
{
__asm("//startTest");
for (int i = 0; i < 100; ++i)
{
std::cout << i;
}
__asm("//endTest");
}
As above, in all four cases, the assembly is the exact same regardless of goto 1 or 2, while {}, or for () {}, per compiler, with just 1 tiny exception for g++ that may be meaningless:
g++:
xorl %ebx, %ebx
leaq _ZSt4cout(%rip), %rbp
.p2align 4,,10
.p2align 3
.L2:
movl %ebx, %esi
movq %rbp, %rdi
addl $1, %ebx
call _ZNSolsEi#PLT
cmpl $100, %ebx
jne .L2
Exception for g++: at the end of the goto2 assembly, the assembly added:
.L3:
endbr64
(I presume this extra label was optimized out of the goto 1's assembly.) I would assume that this is completely insignificant though.
clang++:
xorl %ebx, %ebx
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
movl $_ZSt4cout, %edi
movl %ebx, %esi
callq _ZNSolsEi
addl $1, %ebx
cmpl $100, %ebx
jne .LBB0_1
# %bb.2:
In conclusion/TL;DR: No, there does not appear to be any difference whatsoever between any of the possible equivalent arrangements of label: and goto, do {} while, while {}, and for () {}, at least on Linux using g++ 9.3.0 and clang++ 10.0.0.
Note that I did not test break and continue here; however, given that the assembly code generated for each of the 4 in any scenario was the same, I can only presume that they would be the exact same for break and continue, especially since the assembly is using labels and jumps for every scenario.
To ensure correct results, I was very meticulous in my process and also used Visual Studio Code's compare files feature.
There are several niche verticals where goto is still commonly used as a standard practice, by some very very smart folks and there is no bias against goto in those settings. I used to work at a simulations focused company where all local fortran code had tons of gotos, the team was super smart, and the software worked near flawlessly.
So, we can leave the merit of goto aside, and if the question merely is to compare the loops, then we do so by profiling and/or comparing the assembly code. That said however, the question includes statements like printf etc. You can't really have a discussion about loop control logic optimization when doing that. Also, as others have pointed out, the given loops will all generate VERY similar machine codes.
All conditional branches are considered "taken" (true) in pipelined processor architectures anyway until decode phase, in addition to small loops being usually expanded to be loopless. So, in line with Harper's point above, it is unrealistic for goto to have any advantage whatsoever in simple loop control (just as for or while don't have an advantage over each other). GOTO makes sense usually in multiple nested loops or multiple nested ifs, when adding the additional condition checked by goto into EACH of the nested loops or nested ifs is suboptimal.
When optimizing a search kind of operation in a simple loop, using a sentinal is sometimes more effective than anything else. Essentially, by adding a dummy value at the end of the array, you can avoid checking for two conditions (end of array and value found) to be just one condition (value found), and that saves on cmp operations internally. I am unaware if compilers automatically do that or not.
goto Loop:
start_Chicken:
{
++x;
if (x >= loops)
goto end_Chicken;
}
goto start_Chicken;
end_Chicken:
x = 0;
for Loop:
for (int i = 0; i < loops; i++)
{
}
while Loop:
while (z <= loops)
{
++z;
}
z = 0;
Image from results
While loop in any situation with more mixed tests had as minimal but still better results.
if a function calls itself while defining variables at the same time
would it result in stack overflow? Is there any option in gcc to reuse the same stack.
void funcnew(void)
{
int a=10;
int b=20;
funcnew();
return ;
}
can a function reuse the stack-frame which it used earlier?
What is the option in gcc to reuse the same frame in tail recursion??
Yes. See
-foptimize-sibling-calls
Optimize sibling and tail recursive calls.
Enabled at levels -O2, -O3, -Os.
Your function is compiled to:
funcstack:
.LFB0:
.cfi_startproc
xorl %eax, %eax
jmp func
.cfi_endproc
(note the jump to func)
Reusing the stack frame when a function end by a call -- this include in its full generality manipulating the stack to put the parameters at the correct place and replacing the function call by a jump to the start of the function -- is a well known optimisation called [i]tail call removal[/i]. It is mandated by some languages (scheme for instance) for recursive calls (a recursive call is the natural way to express a loop in these languages). As given above, gcc has the optimisation implemented for C, but I'm not sure which other compiler has it, I would not depend on it for portable code. And note that I don't know which restriction there are on it -- I'm not sure for instance that gcc will manipulate the stack if the parameters types are different.
Even without defining the parameters you'd get a stackoverflow. Since the return address also is pushed onto the stack.
It is (I've learned this recently) possible that the compiler optimizes your loop into a tail recursion (which makes the stack not grow at all). Link to tail recursion question on SO
No, each recursion is a new stack frame. If the recursion is infinitely deep, then the stack needed is also infinite, so you get a stack overflow.
Yes, in some cases the compiler may be able to perform something called tail call optimization. You should check with your compiler manual. (AProgrammer seems to have quoted the GCC manual in his answer.)
This is an essential optimization when implementing for example functional languages, where such code occurs frequently.
You can;t do away with the stack frame altogether, as it is needed for the return address. unless you are using tail-recursion, and your compiler has optimised it to a loop. But to be completely technically honest, you can do away with all the variables in the the frame by making them static. However, this is almost certainly not what you want to do, and you should not do it without knowing exactly what you are doing, which as you had to ask this question, you don't.
As others have noted, it is only possible if (1) your compiler supports tail call optimization, and (2) if your function is eligible for such an optimization. The optimization is to reuse the existing stack and perform a JMP (i.e., a GOTO in assembly) instead of a CALL.
In fact, your example function is indeed eligible for such an optimization. The reason is that the last thing your function does before returning is call itself; it doesn't have to do anything after the last call to funcnew(). However, only certain compilers will perform such an optimization. GCC, for instance, will do it. For more info, see Which, if any, C++ compilers do tail-recursion optimization?
The classic material on this is the factorial function. Let's make a recursive factorial function that is not eligible for tail call optimization (TCO).
int fact(int n)
{
if ( n == 1 ) return 1;
return n*fact(n-1);
}
The last thing it does is to multiply n with the result from fact(n-1). By somehow eliminating this last operation, we would be able to reuse the stack. Let's introduce an accumulator variable that will compute the answer for us:
int fact_helper(int n, int acc)
{
if ( n == 1 ) return acc;
return fact_helper(n-1, n*acc);
}
int fact_acc(int n)
{
return fact_helper(n, 1);
}
The function fact_helper does the work, while fact_acc is just a convenience function to initialize the accumulator variable.
Note how the last thing fact_helper does is to call itself. This CALL can be converted to a JMP by reusing the existing stack for the variables.
With GCC, you can verify that it is optimized to a jump by looking at the generated assembly, for instance gcc -c -O3 -Wa,-a,-ad fact.c:
...
37 L12:
38 0040 0FAFC2 imull %edx, %eax
39 0043 83EA01 subl $1, %edx
40 0046 83FA01 cmpl $1, %edx
41 0049 75F5 jne L12
...
Some programming languages, such as Scheme, will actually guarantee that proper implementations will perform such optimizations. They will even do it for non-recursive tail calls.