C/C++: is GOTO faster than WHILE and FOR?

C/C++: is GOTO faster than WHILE and FOR? - c

I know, that everybody hates GOTO and nobody recommends it. But that's not the point. I just want to know, which code is the fastest:
the goto loop
int i=3;
loop:
printf("something");
if(--i) goto loop;
the while loop
int i=3;
while(i--) {
printf("something");
}
the for loop
for(int i=3; i; i--) {
printf("something");
}

Generally speaking, for and while loops get compiled to the same thing as goto, so it usually won't make a difference. If you have your doubts, you can feel free to try all three and see which takes longer. Odds are you'll be unable to measure a difference, even if you loop a billion times.
If you look at this answer, you'll see that the compiler can generate exactly the same code for for, while, and goto (only in this case there was no condition).

The only time I've seen the argument made for goto was in one of W. Richard Stevens' articles or books. His point was that in a very time-critical section of code (I believe his example was the network stack), having nested if/else blocks with related error-handling code could be redone using goto in a way that made a valuable difference.
Personally, I'm not good enough a programmer to argue with Stevens' work, so I won't try. goto can be useful for performance-related issues, but the limits of when that is so are fairly strict.

Write short programs, then do this:
gcc -S -O2 p1.c
gcc -S -O2 p2.c
gcc -S -O2 p3.c
Analyze the output and see if there's any difference. Be sure to introduce some level of unpredictability such that the compiler doesn't optimize the program away to nothing.
Compilers do a great job of optimizing these trivial concerns. I'd suggest not to worry about it, and instead focus on what makes you more productive as a programmer.
Speed and efficiency is a great thing to worry about it, but 99% of the time that involves using proper data structures and algorithms... not worrying about whether a for is faster than a while or a goto, etc.

It is probably both compiler, optimiser and architecture specific.
For example the code if(--i) goto loop; is a conditional test followed by an unconditional branch. A compiler might simply generate corresponding code or it might be smart enough (though a compiler that did not have at least that much smarts may not be worth much), to generate a single conditional branch instruction. while(i--) on the other hand is already a conditional branch at the source level, so translation to a conditional branch at the machine level may be more likley regardless of the sophistication of the compiler implementation or optimiser.
In the end, the difference is likley to be minute and only relevant if a great many iterations are required, and the way you should answer this question is to build the code for the specific target and compiler (and compiler settings) of interest, and either inspect the resultant machine level code or directly measure execution time.
In your examples the printf() in the loop will dominate any timing in any case; something simpler in the loop would make observations of the differences easier. I would suggest an empty loop, and then declaring i volatile to prevent the loop being optimised to nothing.

As long as you're generating the same flow of control as a normal loop, pretty nearly any decent compiler can and will produce the same code whether you use for, while, etc. for it.
You can gain something from using goto, but usually only if you're generating a flow of control that a normal loop simply can't (at least cleanly). A typical example is jumping into the middle of a loop to get a loop and a half construct, which most languages' normal loop statements (including C's) don't provide cleanly.

There is should not be any significant difference between all the loops and the goto. Except the idea, that compiler more probably will not try to optimize the GOTO-things at all.
And there is not a lot of sense trying to optimize compiler-generated stuff in loops. It's more sense to optimize the code inside the loop, or reduce the number of iterations or so on.

I think there will be some code after compiler under nornal condition.
In fact I think goto is very convenient sometimes, although it is hard to read.

On Linux, I compiled the code below into assembly using both g++ and clang++. For more information on how I did that, see here. (Short version: g++ -S -O3 filename.cpp clang++ -S -O3 filename.cpp, and some assembly comments you'll see below to help me out.)
Conclusion/TL;DR at the bottom.
First, I compared label: and goto vs. do {} while. You can't compare a for () {} loop with this (in good faith), because a for loop always evaluates the condition first. This time around, the condition is evaluated only after the loop code has been executed once.
#include <iostream>
void testGoto()
{
__asm("//startTest");
int i = 0;
loop:
std::cout << i;
++i;
if (i < 100)
{
goto loop;
}
__asm("//endTest");
}
#include <iostream>
void testDoWhile()
{
__asm("//startTest");
int i = 0;
do
{
std::cout << i;
++i;
}
while (i < 100);
__asm("//endTest");
}
In both cases, the assembly is the exact same regardless of goto or do {} while, per compiler:
g++:
xorl %ebx, %ebx
leaq _ZSt4cout(%rip), %rbp
.p2align 4,,10
.p2align 3
.L2:
movl %ebx, %esi
movq %rbp, %rdi
addl $1, %ebx
call _ZNSolsEi#PLT
cmpl $100, %ebx
jne .L2
clang++:
xorl %ebx, %ebx
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
movl $_ZSt4cout, %edi
movl %ebx, %esi
callq _ZNSolsEi
addl $1, %ebx
cmpl $100, %ebx
jne .LBB0_1
# %bb.2:
Then I compared label: and goto vs. while {} vs. for () {}. This time around, the condition is evaluated before the loop code has been executed even once.
For goto, I had to invert the condition, at least for the first time. I saw two ways of implementing it, so I tried both ways.
#include <iostream>
void testGoto1()
{
__asm("//startTest");
int i = 0;
loop:
if (i >= 100)
{
goto exitLoop;
}
std::cout << i;
++i;
goto loop;
exitLoop:
__asm("//endTest");
}
#include <iostream>
void testGoto2()
{
__asm("//startTest");
int i = 0;
if (i >= 100)
{
goto exitLoop;
}
loop:
std::cout << i;
++i;
if (i < 100)
{
goto loop;
}
exitLoop:
__asm("//endTest");
}
#include <iostream>
void testWhile()
{
__asm("//startTest");
int i = 0;
while (i < 100)
{
std::cout << i;
++i;
}
__asm("//endTest");
}
#include <iostream>
void testFor()
{
__asm("//startTest");
for (int i = 0; i < 100; ++i)
{
std::cout << i;
}
__asm("//endTest");
}
As above, in all four cases, the assembly is the exact same regardless of goto 1 or 2, while {}, or for () {}, per compiler, with just 1 tiny exception for g++ that may be meaningless:
g++:
xorl %ebx, %ebx
leaq _ZSt4cout(%rip), %rbp
.p2align 4,,10
.p2align 3
.L2:
movl %ebx, %esi
movq %rbp, %rdi
addl $1, %ebx
call _ZNSolsEi#PLT
cmpl $100, %ebx
jne .L2
Exception for g++: at the end of the goto2 assembly, the assembly added:
.L3:
endbr64
(I presume this extra label was optimized out of the goto 1's assembly.) I would assume that this is completely insignificant though.
clang++:
xorl %ebx, %ebx
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
movl $_ZSt4cout, %edi
movl %ebx, %esi
callq _ZNSolsEi
addl $1, %ebx
cmpl $100, %ebx
jne .LBB0_1
# %bb.2:
In conclusion/TL;DR: No, there does not appear to be any difference whatsoever between any of the possible equivalent arrangements of label: and goto, do {} while, while {}, and for () {}, at least on Linux using g++ 9.3.0 and clang++ 10.0.0.
Note that I did not test break and continue here; however, given that the assembly code generated for each of the 4 in any scenario was the same, I can only presume that they would be the exact same for break and continue, especially since the assembly is using labels and jumps for every scenario.
To ensure correct results, I was very meticulous in my process and also used Visual Studio Code's compare files feature.

There are several niche verticals where goto is still commonly used as a standard practice, by some very very smart folks and there is no bias against goto in those settings. I used to work at a simulations focused company where all local fortran code had tons of gotos, the team was super smart, and the software worked near flawlessly.
So, we can leave the merit of goto aside, and if the question merely is to compare the loops, then we do so by profiling and/or comparing the assembly code. That said however, the question includes statements like printf etc. You can't really have a discussion about loop control logic optimization when doing that. Also, as others have pointed out, the given loops will all generate VERY similar machine codes.
All conditional branches are considered "taken" (true) in pipelined processor architectures anyway until decode phase, in addition to small loops being usually expanded to be loopless. So, in line with Harper's point above, it is unrealistic for goto to have any advantage whatsoever in simple loop control (just as for or while don't have an advantage over each other). GOTO makes sense usually in multiple nested loops or multiple nested ifs, when adding the additional condition checked by goto into EACH of the nested loops or nested ifs is suboptimal.
When optimizing a search kind of operation in a simple loop, using a sentinal is sometimes more effective than anything else. Essentially, by adding a dummy value at the end of the array, you can avoid checking for two conditions (end of array and value found) to be just one condition (value found), and that saves on cmp operations internally. I am unaware if compilers automatically do that or not.

goto Loop:
start_Chicken:
{
++x;
if (x >= loops)
goto end_Chicken;
}
goto start_Chicken;
end_Chicken:
x = 0;
for Loop:
for (int i = 0; i < loops; i++)
{
}
while Loop:
while (z <= loops)
{
++z;
}
z = 0;
Image from results
While loop in any situation with more mixed tests had as minimal but still better results.

Related

GCC/x86 inline asm: How do you tell gcc that inline assembly section will modify %esp?

While trying to make some old code work again (https://github.com/chaos4ever/chaos/blob/master/libraries/system/system_calls.h#L387, FWIW) I discovered that some of the semantics of gcc seem to have changed in a quite subtle but still dangerous way during the latest 10-15 years... :P
The code used to work well with older versions of gcc, like 2.95. Anyway, here is the code:
static inline return_type system_call_service_get(const char *protocol_name, service_parameter_type *service_parameter,
tag_type *identification)
{
return_type return_value;
asm volatile("pushl %2\n"
"pushl %3\n"
"pushl %4\n"
"lcall %5, $0"
: "=a" (return_value),
"=g" (*service_parameter)
: "g" (identification),
"g" (service_parameter),
"g" (protocol_name),
"n" (SYSTEM_CALL_SERVICE_GET << 3));
return return_value;
}
The problem with the code above is that gcc (4.7 in my case) will compile this to the following asm code (AT&T syntax):
# 392 "../system/system_calls.h" 1
pushl 68(%esp) # This pointer (%esp + 0x68) is valid when the inline asm is entered.
pushl %eax
pushl 48(%esp) # ...but this one is not (%esp + 0x48), since two dwords have now been pushed onto the stack, so %esp is not what the compiler expects it to be
lcall $456, $0
# Restoration of %esp at this point is done in the called method (i.e. lret $12)
The problem: The variables (identification and protocol_name) are on the stack in the calling context. So gcc (with optimizations turned out, unsure if it matters) will just get the values from there and hand it over to the inline asm section. But since I'm pushing stuff on the stack, the offsets that gcc calculate will be off by 8 in the third call (pushl 48(%esp)). :)
This took me a long time to figure out, it wasn't all obvious to me at first.
The easiest way around this is of course to use the r input constraint, to ensure that the value is in a register instead. But is there another, better way? One obvious way would of course be to rewrite the whole system call interface to not push stuff on the stack in the first place (and use registers instead, like e.g. Linux), but that's not a refactoring I feel like doing tonight...
Is there any way to tell gcc inline asm that "the stack is volatile"? How have you guys been handling stuff like this in the past?
Update later the same evening: I did found a relevant gcc ML thread (https://gcc.gnu.org/ml/gcc-help/2011-06/msg00206.html) but it didn't seem to help. It seems like specifying %esp in the clobber list should make it do offsets from %ebp instead, but it doesn't work and I suspect the -O2 -fomit-frame-pointer has an effect here. I have both of these flags enabled.

What works and what doesn't:
I tried omitting -fomit-frame-pointer. No effect whatsoever. I included %esp, esp and sp in the list of clobbers.
I tried omitting -fomit-frame-pointer and -O3. This actually produces code that works, since it relies on %ebp rather than %esp.
pushl 16(%ebp)
pushl 12(%ebp)
pushl 8(%ebp)
lcall $456, $0
I tried with just having -O3 and not -fomit-frame-pointer specified in my command line. Creates bad, broken code (relies on %esp being constant within the whole assembly block, i.e. no stack frame).
I tried with skipping -fomit-frame-pointer and just using -O2. Broken code, no stack frame.
I tried with just using -O1. Broken code, no stack frame.
I tried adding cc as clobber. No can do, doesn't make any difference whatsoever.
I tried changing the input constraints to ri, giving the input & output code below. This of course works but is slightly less elegant than I had hoped. Then again, perfect is the enemy of good so maybe I will have to live with this for now.
Input C code:
static inline return_type system_call_service_get(const char *protocol_name, service_parameter_type *service_parameter,
tag_type *identification)
{
return_type return_value;
asm volatile("pushl %2\n"
"pushl %3\n"
"pushl %4\n"
"lcall %5, $0"
: "=a" (return_value),
"=g" (*service_parameter)
: "ri" (identification),
"ri" (service_parameter),
"ri" (protocol_name),
"n" (SYSTEM_CALL_SERVICE_GET << 3));
return return_value;
}
Output asm code. As can be seen, using registers instead which should always be safe (but maybe somewhat less performant since the compiler has to move stuff around):
#APP
# 392 "../system/system_calls.h" 1
pushl %esi
pushl %eax
pushl %ebx
lcall $456, $0

Will compilers optimize double logical negation in conditionals?

Consider the following hypothetical type:
typedef struct Stack {
unsigned long len;
void **elements;
} Stack;
And the following hypothetical macros for dealing with the type (purely for enhanced readability.) In these macros I am assuming that the given argument has type (Stack *) instead of merely Stack (I can't be bothered to type out a _Generic expression here.)
#define stackNull(stack) (!stack->len)
#define stackHasItems(stack) (stack->len)
Why do I not simply use !stackNull(x) for checking if a stack has items? I thought that this would be slightly less efficient (read: not noticeable at all really, but I thought it was interesting) than simply checking stack->len because it would lead to double negation. In the following case:
int thingy = !!31337;
printf("%d\n", thingy);
if (thingy)
doSomethingImportant(thingy);
The string "1\n" would be printed, and It would be impossible to optimize the conditional (well actually, only impossible if the thingy variable didn't have a constant initializer or was modified before the test, but we'll say in this instance that 31337 is not a constant) because (!!x) is guaranteed to be either 0 or 1.
But I'm wondering if compilers will optimize something like the following
int thingy = wellOkaySoImNotAConstantThingyAnyMore();
if (!!thingy)
doSomethingFarLessImportant();
Will this be optimized to actually just use (thingy) in the if statement, as if the if statement had been written as
if (thingy)
doSomethingFarLessImportant();
If so, does it expand to (!!!!!thingy) and so on? (however this is a slightly different question, as this can be optimized in any case, !thingy is !!!!!thingy no matter what, just like -(-(-(1))) = -1.)
In the question title I said "compilers", by that I mean that I am asking if any compiler does this, however I am particularly interested in how GCC will behave in this instance as it is my compiler of choice.

This seems like a pretty reasonable optimization and a quick test using godbolt with this code (see it live):
#include <stdio.h>
void func( int x)
{
if( !!x )
{
printf( "first\n" ) ;
}
if( !!!x )
{
printf( "second\n" ) ;
}
}
int main()
{
int x = 0 ;
scanf( "%d", &x ) ;
func( x ) ;
}
seems to indicate gcc does well, it generates the following:
func:
testl %edi, %edi # x
jne .L4 #,
movl $.LC1, %edi #,
jmp puts #
.L4:
movl $.LC0, %edi #,
jmp puts #
we can see from the first line:
testl %edi, %edi # x
it just uses x without doing any operations on it, also notice the optimizer is clever enough to combine both tests into one since if the first condition is true the other must be false.
Note I used printf and scanf for side effects to prevent the optimizer from optimizing all the code away.

Can A C Compiler Eliminate This Conditional Test At Runtime?

Let's say I have pseudocode like this:
main() {
BOOL b = get_bool_from_environment(); //get it from a file, network, registry, whatever
while(true) {
do_stuff(b);
}
}
do_stuff(BOOL b) {
if(b)
path_a();
else
path_b();
}
Now, since we know that the external environment can influence get_bool_from_environment() to potentially produce either a true or false result, then we know that the code for both the true and false branches of if(b) must be included in the binary. We can't simply omit path_a(); or path_b(); from the code.
BUT -- we only set BOOL b the one time, and we always reuse the same value after program initialization.
If I were to make this valid C code and then compile it using gcc -O0, the if(b) would be repeatedly evaluated on the processor each time that do_stuff(b) is invoked, which inserts what are, in my opinion, needless instructions into the pipeline for a branch that is basically static after initialization.
If I were to assume that I actually had a compiler that was as stupid as gcc -O0, I would re-write this code to include a function pointer, and two separate functions, do_stuff_a() and do_stuff_b(), which don't perform the if(b) test, but simply go ahead and perform one of the two paths. Then, in main(), I would assign the function pointer based on the value of b, and call that function in the loop. This eliminates the branch, though it admittedly adds a memory access for the function pointer dereference (due to architecture implementation I don't think I really need to worry about that).
Is it possible, even in principle, for a compiler to take code of the same style as the original pseudocode sample, and to realize that the test is unnecessary once the value of b is assigned once in main()? If so, what is the theoretical name for this compiler optimization, and can you please give an example of an actual compiler implementation (open source or otherwise) which does this?
I realize that compilers can't generate dynamic code at runtime, and the only types of systems that could do that in principle would be bytecode virtual machines or interpreters (e.g. Java, .NET, Ruby, etc.) -- so the question remains whether or not it is possible to do this statically and generate code that contains both the path_a(); branch and the path_b() branch, but avoid evaluating the conditional test if(b) for every call of do_stuff(b);.

If you tell your compiler to optimise, you have a good chance that the if(b) is evaluated only once.
Slightly modifying the given example, using the standard _Bool instead of BOOL, and adding the missing return types and declarations,
_Bool get_bool_from_environment(void);
void path_a(void);
void path_b(void);
void do_stuff(_Bool b) {
if(b)
path_a();
else
path_b();
}
int main(void) {
_Bool b = get_bool_from_environment(); //get it from a file, network, registry, whatever
while(1) {
do_stuff(b);
}
}
the (relevant part of the) produced assembly by clang -O3 [clang-3.0] is
callq get_bool_from_environment
cmpb $1, %al
jne .LBB1_2
.align 16, 0x90
.LBB1_1: # %do_stuff.exit.backedge.us
# =>This Inner Loop Header: Depth=1
callq path_a
jmp .LBB1_1
.align 16, 0x90
.LBB1_2: # %do_stuff.exit.backedge
# =>This Inner Loop Header: Depth=1
callq path_b
jmp .LBB1_2
b is tested only once, and main jumps into an infinite loop of either path_a or path_b depending on the value of b. If path_a and path_b are small enough, they would be inlined (I strongly expect). With -O and -O2, the code produced by clang would evaluate b in each iteration of the loop.
gcc (4.6.2) behaves similarly with -O3:
call get_bool_from_environment
testb %al, %al
jne .L8
.p2align 4,,10
.p2align 3
.L9:
call path_b
.p2align 4,,6
jmp .L9
.L8:
.p2align 4,,8
call path_a
.p2align 4,,8
call path_a
.p2align 4,,5
jmp .L8
oddly, it unrolled the loop for path_a, but not for path_b. With -O2 or -O, it would however call do_stuff in the infinite loop.
Hence to
Is it possible, even in principle, for a compiler to take code of the same style as the original pseudocode sample, and to realize that the test is unnecessary once the value of b is assigned once in main()?
the answer is a definitive Yes, it is possible for compilers to recognize this and take advantage of that fact. Good compilers do when asked to optimise hard.
If so, what is the theoretical name for this compiler optimization, and can you please give an example of an actual compiler implementation (open source or otherwise) which does this?
I don't know the name of the optimisation, but two implementations doing that are gcc and clang (at least, recent enough releases).

Illegal instruction when running a minimal OpenMP program

This minimal OpenMP program
#include <omp.h>
int main()
{
#pragma omp parallel sections
{
#pragma omp section
{
while(1) {}
}
#pragma omp section
{
while(1) {}
}
}
}
will produce this error when compiled and run with gcc test.c -fopenmp:
Illegal instruction (core dumped)
When I change either one of the loops with
int i=1;
while(i++) {}
or any other condition it compiles and runs without error. It seems, that 1 as a loop condition in different threads cause some strange behaviour. Why?
edit: I am using gcc 4.6.3
edit: This is a bug in gcc and was submitted as Bug 54017 to the gcc developers.

This is apparently a bug in GCC. GCC implements OpenMP sections using the GOMP_sections_start() routine from libgomp that returns a 1-based section ID that the calling thread should execute or 0 if all work items have been distributed. Basically the transformed code should look like:
main._omp_fn.0 (void * .omp_data_i)
{
unsigned int .section.1;
.section.1 = GOMP_sections_start(2);
L0:
switch (.section.1)
{
case 0:
// No more sections to run, exit
goto L2;
case 1:
// Do section 1
while (1) {}
goto L1;
case 2:
// Do section 2
while (1) {}
goto L1;
default:
// Impossible section value, possible error in libgomp
__builtin_trap();
}
L1:
.section.1 = GOMP_sections_next();
goto L0;
L2:
GOMP_sections_end_nowait();
return;
}
What happens is that in your case the both the default and the 0 case lead to __builtin_trap(). __builtin_trap() is a GCC built-in that is supposed to terminate your program abnormally and on x86 it emits the ud2 instruction that makes the CPU to bark with an illegal opcode exception. It is usually put in places where code should never execute, e.g. all possible correct return values from GOMP_sections_start() and GOMP_sections_next() should be covered by the cases in the switch and if the default is reached (signalling a possible bug in libgomp) it should fail and you will complain to the developers :)
Edit: This is definitely not expected OpenMP behaviour and it does not happen with icc or suncc. I have submitted Bug 54017 to the GCC Bugzilla.
Edit 2: I updated the text to more closely reflect what GCC should produce. It looks like GCC is getting wrong impression of the control flow in the parallel region and does some "optimisations" that further spoil code generation.

SIGILL generated, because there is an illegal instruction, ud2/ud2a.
According to http://asm.inightmare.org/opcodelst/index.php?op=UD2:
This instruction caused #UD. Intel guaranteed that in future Intel's
CPUs this instruction will caused #UD. Of course all previous CPUs
(186+) caused #UD on this opcode. This instruction used by software
writers for testing #UD exception servise routine.
Let's look inside:
$ gcc-4.6.2 -fopenmp omp.c -o omp
$ gdb ./omp
...
(gdb) r
Program received signal SIGILL, Illegal instruction.
...
0x08048544 in main._omp_fn.0 ()
(gdb) x/i $pc
0x8048544 <main._omp_fn.0+28>: ud2a
(gdb) disassemble
Dump of assembler code for function main._omp_fn.0:
0x08048528 <main._omp_fn.0+0>: push %ebp
0x08048529 <main._omp_fn.0+1>: mov %esp,%ebp
0x0804852b <main._omp_fn.0+3>: sub $0x18,%esp
0x0804852e <main._omp_fn.0+6>: movl $0x2,(%esp)
0x08048535 <main._omp_fn.0+13>: call 0x80483f0 <GOMP_sections_start#plt>
0x0804853a <main._omp_fn.0+18>: cmp $0x1,%eax
0x0804853d <main._omp_fn.0+21>: je 0x8048548 <main._omp_fn.0+32>
0x0804853f <main._omp_fn.0+23>: cmp $0x2,%eax
0x08048542 <main._omp_fn.0+26>: je 0x8048546 <main._omp_fn.0+30>
0x08048544 <main._omp_fn.0+28>: ud2a
0x08048546 <main._omp_fn.0+30>: jmp 0x8048546 <main._omp_fn.0+30>
0x08048548 <main._omp_fn.0+32>: jmp 0x8048548 <main._omp_fn.0+32>
End of assembler dump.
There is ud2a in assembler file already:
$ gcc-4.6.2 -fopenmp omp.c -o omp.S -S; cat omp.S
main._omp_fn.0:
.LFB1:
pushl %ebp
.LCFI4:
movl %esp, %ebp
.LCFI5:
subl $24, %esp
.LCFI6:
movl $2, (%esp)
call GOMP_sections_start
cmpl $1, %eax
je .L4
cmpl $2, %eax
je .L5
.value 0x0b0f
.value 0xb0f is code of ud2a
After verifying that ud2a was inserted by intention of gcc (at early openmp phases), I tried to understand the code. The function main._omp_fn.0 is the body of parallel code; it will call _GOMP_sections_start and parse its return code. If code equal to 1 then we will jump to one infinite loop; if it is 2, jump to second infinite loop. But in other case ud2a will be executed. (Don't know why, but according to Hristo Iliev this is a GCC Bug 54017.)
I think, this test is good to check how much CPU cores there are. By default GCC's openmp library (libgomp) will start a thread for every CPU core in your system (in my case there were 4 threads). And sections will be selected in order: first section for first thread, second section - 2nd thread and so on.
There is no SIGILL, if I run the program on 1 or 2 CPUs (option of taskset is the cpu mask in hex):
$ taskset 3 ./omp
... running on cpu0 and cpu1 ...
$ taskset 1 ./omp
... running first loop on cpu0; then run second loop on cpu0...

gcc removes inline assembler code

It seems like gcc 4.6.2 removes code it considers unused from functions.
test.c
int main(void) {
goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
Disassembly of main()
0x08048404 <+0>: push ebp
0x08048405 <+1>: mov ebp,esp
0x08048407 <+3>: nop # <-- This is all whats left of my jmp.
0x08048408 <+4>: mov eax,0x0
0x0804840d <+9>: pop ebp
0x0804840e <+10>: ret
Compiler options
No optimizations enabled, just gcc -m32 -o test test.c (-m32 because I'm on a 64 bit machine).
How can I stop this behavior?
Edit: Preferably by using compiler options, not by modifing the code.

Looks like that's just the way it is - When gcc sees that code within a function is unreachable, it removes it. Other compilers might be different.
In gcc, an early phase in compilation is building the "control flow graph" - a graph of "basic blocks", each free of conditions, connected by branches. When emitting the actual code, parts of the graph, which are not reachable from the root, are discarded.
This isn't part of the optimization phase, and is therefore unaffected by compilation options.
So any solution would involve making gcc think that the code is reachable.
My suggestion:
Instead of putting your assembly code in an unreachable place (where GCC may remove it), you can put it in a reachable place, and skip over the problematic instruction:
int main(void) {
goto exit;
exit:
__asm__ __volatile__ (
"jmp 1f\n"
"jmp $0x0\n"
"1:\n"
);
return 0;
}
Also, see this thread about the issue.

I do not believe there is a reliable way using just compile options to solve this. The preferable mechanism is something that will do the job and work on future versions of the compiler regardless of the options used to compile.
Commentary about Accepted Answer
In the accepted answer there is an edit to the original that suggests this solution:
int main(void) {
__asm__ ("jmp exit");
handler:
__asm__ __volatile__("jmp $0x0");
exit:
return 0;
}
First off jmp $0x0 should be jmp 0x0. Secondly C labels usually get translated into local labels. jmp exit doesn't actually jump to the label exit in the C function, it jumps to the exit function in the C library effectively bypassing the return 0 at the bottom of main. Using Godbolt with GCC 4.6.4 we get this non-optimized output (I have trimmed the labels we don't care about):
main:
pushl %ebp
movl %esp, %ebp
jmp exit
jmp 0x0
.L3:
movl $0, %eax
popl %ebp
ret
.L3 is actually the local label for exit. You won't find the exit label in the generated assembly. It may compile and link if the C library is present. Do not use C local goto labels in inline assembly like this.
Use asm goto as the Solution
As of GCC 4.5 (OP is using 4.6.x) there is support for asm goto extended assembly templates. asm goto allows you to specify jump targets that the inline assembly may use:
6.45.2.7 Goto Labels
asm goto allows assembly code to jump to one or more C labels. The GotoLabels section in an asm goto statement contains a comma-separated list of all C labels to which the assembler code may jump. GCC assumes that asm execution falls through to the next statement (if this is not the case, consider using the __builtin_unreachable intrinsic after the asm statement). Optimization of asm goto may be improved by using the hot and cold label attributes (see Label Attributes).
An asm goto statement cannot have outputs. This is due to an internal restriction of the compiler: control transfer instructions cannot have outputs. If the assembler code does modify anything, use the "memory" clobber to force the optimizers to flush all register values to memory and reload them if necessary after the asm statement.
Also note that an asm goto statement is always implicitly considered volatile.
To reference a label in the assembler template, prefix it with ‘%l’ (lowercase ‘L’) followed by its (zero-based) position in GotoLabels plus the number of input operands. For example, if the asm has three inputs and references two labels, refer to the first label as ‘%l3’ and the second as ‘%l4’).
Alternately, you can reference labels using the actual C label name enclosed in brackets. For example, to reference a label named carry, you can use ‘%l[carry]’. The label must still be listed in the GotoLabels section when using this approach.
The code could be written this way:
int main(void) {
__asm__ goto ("jmp %l[exit]" :::: exit);
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
We can use asm goto. I prefer __asm__ over asm since it will not throw warnings if compiling with -ansi or -std=? options.
After the clobbers you can list the jump targets the inline assembly may use. C doesn't actually know if we jump or not as GCC doesn't analyze the actual code in the inline assembly template. It can't remove this jump, nor can it assume what comes after is dead code. Using Godbolt with GCC 4.6.4 the unoptimized code (trimmed) looks like:
main:
pushl %ebp
movl %esp, %ebp
jmp .L2 # <------ this is the goto exit
jmp 0x0
.L2: # <------ exit label
movl $0, %eax
popl %ebp
ret
The Godbolt with GCC 4.6.4 output still looks correct and appears as:
main:
jmp .L2 # <------ this is the goto exit
jmp 0x0
.L2: # <------ exit label
xorl %eax, %eax
ret
This mechanism should also work whether you have optimizations on or off, and shouldn't matter whether you are compiling for 64-bit or 32-bit x86 targets.
Other Observations
When there are no output constraints in an extended inline assembly template the asm statement is implicitly volatile. The line
__asm__ __volatile__("jmp 0x0");
Can be written as:
__asm__ ("jmp 0x0");
asm goto statements are considered implicitly volatile. They don't require a volatile modifier either.

Would this work, make it so gcc can't know its unreachable
int main(void)
{
volatile int y = 1;
if (y) goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}

If a compiler thinks it can cheat you, just cheat back: (GCC only)
int main(void) {
{
/* Place this code anywhere in the same function, where
* control flow is known to still be active (such as at the start) */
extern volatile unsigned int some_undefined_symbol;
__asm__ __volatile__(".pushsection .discard" : : : "memory");
if (some_undefined_symbol) goto handler;
__asm__ __volatile__(".popsection" : : : "memory");
}
goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}
This solution will not add any additional overhead for meaningless instructions, though only works for GCC when used with AS (as is the default).
Explaination: .pushsection switches text output of the compiler to another section, in this case .discard (which is deleted during linking by default). The "memory" clobber prevents GCC from trying to move other text within the section that will be discarded. However, GCC doesn't realize (and never could because the __asm__s are __volatile__) that anything happening between the 2 statements will be discarded.
As for some_undefined_symbol, that is literally just any symbol that is never being defined (or is actually defined, it shouldn't matter). And since the section of code using it will be discarded during linking, it won't produce any unresolved-reference errors either.
Finally, the conditional jump to the label you want to make appear as though it was reachable does exactly that. Besides that fact that it won't appear in the output binary at all, GCC realizes that it can't know anything about some_undefined_symbol, meaning it has no choice but to assume that both of the if's branches are reachable, meaning that as far as it is concerned, control flow can continue both by reaching goto exit, or by jumping to handler (even though there won't be any code that could even do this)
However, be careful when enabling garbage collection in your linker ld --gc-sections (it's disabled by default), because otherwise it might get the idea to get rid of the still unused label regardless.
EDIT:
Forget all that. Just do this:
int main(void) {
__asm__ __volatile__ goto("" : : : : handler);
goto exit;
handler:
__asm__ __volatile__("jmp 0x0");
exit:
return 0;
}

Update 2012/6/18
Just thinking about it, one can put the goto exit in an asm block, which means that only 1 line of code needs to change:
int main(void) {
__asm__ ("jmp exit");
handler:
__asm__ __volatile__("jmp $0x0");
exit:
return 0;
}
That is significantly cleaner than my other solution below (and possibly nicer than #ugoren's current one too).
This is pretty hacky, but it seems to work: hide the handler in a conditional that can never be followed under normal conditions, but stop it from being eliminated by stopping the compiler from being able to do its analysis properly with some inline assembler.
int main (void) {
int x = 0;
__asm__ __volatile__ ("" : "=r"(x));
// compiler can't tell what the value of x is now, but it's always 0
if (x) {
handler:
__asm__ __volatile__ ("jmp $0x0");
}
return 0;
}
Even with -O3 the jmp is preserved:
testl %eax, %eax
je .L2
.L3:
jmp $0x0
.L2:
xorl %eax, %eax
ret
(This seems really dodgy, so I hope there is a better way to do this. edit just putting a volatile in front of x works so one doesn't need to do the inline asm trickery.)

I've never heard of a way to prevent gcc from removing unreachable code; it seems that no matter what you do, once gcc detects unreachable code it always removes it (use gcc's -Wunreachable-code option to see what it considers to be unreachable).
That said, you can still put this code in a static function and it won't be optimized out:
static int func()
{
__asm__ __volatile__("jmp $0x0");
}
int main(void)
{
goto exit;
handler:
func();
exit:
return 0;
}
P.S
This solution is particularily handy if you want to avoid code redundancy when implanting the same "handler" code block in more than one place in the original code.

gcc may duplicate asm statements inside functions and remove them during optimisation (even at -O0), so this will never work reliably.
one way to do this reliably is to use a global asm statement (i.e. an asm statement outside of any function). gcc will copy this straight to the output and you can use global labels without any problems.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight