Compiler optimization causing program to run slower

Compiler optimization causing program to run slower - c

I have the following piece of code that I wrote in C. Its fairly simple as it just right bit-shifts x for every loop of for.
int main() {
int x = 1;
for (int i = 0; i > -2; i++) {
x >> 2;
}
}
Now the strange thing that is happening is that when I just compile it without any optimizations or with first level optimization (-O), it runs just fine (I am timing the executable and its about 1.4s with -O and 5.4s without any optimizations.
Now when I add -O2 or -O3 switch for compilation and time the resulting executable, it doesn't stop (I have tested for up to 60s).
Any ideas on what might be causing this?

The optimized loop is producing an infinite loop which is a result of you depending on signed integer overflow. Signed integer overflow is undefined behavior in C and should not be depended on. Not only can it confuse developers it may also be optimized out by the compiler.
Assembly (no optimizations): gcc -std=c99 -S -O0 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $1, -4(%rbp)
movl $0, -8(%rbp)
jmp L2
L3:
incl -8(%rbp)
L2:
cmpl $-2, -8(%rbp)
jg L3
movl $0, %eax
leave
ret
Assembly (optimized level 3): gcc -std=c99 -S -O3 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
L2:
jmp L2 #<- infinite loop

You will get the definitive answer by looking at the binary that's produced (using objdump or something).
But as others have noted, this is probably because you're relying on undefined behaviour. One possible explanation is that the compiler is free to assume that i will never be less than -2, and so will eliminate the conditional entirely, and convert this into an infinite loop.
Also, your code has no observable side effects, so the compiler is also free to optimise the entire program away to nothing, if it likes.

Additional information about why integer overflows are undefined can be found here:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Search for the paragraph "Signed integer overflow".

Related

Return value from writing an unused parameter when falling off the end of a non-void function

In this golfing answer I saw a trick where the return value is the second parameter which is not passed in.
int f(i, j)
{
j = i;
}
int main()
{
return f(3);
}
From gcc's assembly output it looks like when the code copies j = i it stores the result in eax which happens to be the return value.
f:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $3, %edi
movl $0, %eax
call f
popq %rbp
ret
So, did this happen just by being lucky? Is this documented by gcc? It only works with -O0, but it works with a bunch of values of i I tried, -m32, and a bunch of different versions of GCC.

gcc -O0 likes to evaluate expressions in the return-value register, if a register is needed at all. (GCC -O0 generally just likes to have values in the retval register, but this goes beyond picking that as the first temporary.)
I've tested a bit, and it really looks like GCC -O0 does this on purpose across multiple ISAs, sometimes even using an extra mov instruction or equivalent. IIRC I made an expression more complicated so the result of evaluation ended up in another register, but it still copied it back to the retval register.
Things like x++ that can (on x86) compile to a memory-destination inc or add won't leave the value in a register, but assignments typically will. So it's note quite like GCC is treating function bodies like GNU C statement-expressions.
This is not documented, guaranteed, or standardized by anything. It's an implementation detail, not something intended for you to take advantage of like this.
"Returning" a value this way means you're programming in "GCC -O0", not C. The wording of the code-golf rules says that programs have to work on at least one implementation. But my reading of that is that they should work for the right reasons, not because of some side-effect implementation detail. They break on clang not because clang doesn't support some language feature, just because they're not even written in C.
Breaking with optimization enabled is also not cool; some level of UB is generally acceptable in code golf, like integer wraparound or pointer-casting type punning being things that one might reasonably wish were well-defined. But this is pure abuse of an implementation detail of one compiler, not a language feature.
I argued this point in comments under the relevant answer on Codegolf.SE C golfing tips Q&A (Which incorrectly claims it works beyond GCC). That answer has 4 downvotes (and deserves more IMO), but 16 upvotes. So some members of the community disagree that this is terrible and silly.
Fun fact: in ISO C++ (but not C), having execution fall off the end of a non-void function is Undefined Behaviour, even if the caller doesn't use the result. This is true even in GNU C++; outside of -O0 GCC and clang will sometimes emit code like ud2 (illegal instruction) for a path of execution that reaches the end of a function without a return. So GCC doesn't in general define the behaviour here (which implementations are allowed to do for things that ISO C and C++ leaves undefined. e.g. gcc -fwrapv defines signed overflow as 2's complement wraparound.)
But in ISO C, it's legal to fall off the end of a non-void function: it only becomes UB if the caller uses the return value. Without -Wall GCC may not even warn. Checking return value of a function without return statement
With optimization disabled, function inlining won't happen so the UB isn't really compile-time visible. (Unless you use __attribute__((always_inline))).
Passing a 2nd arg merely gives you something to assign to. It's not important that it's a function arg. But i=i; optimizes away even with -O0 so you do need a separate variable. Also just i; optimizes away.
Fun fact: a recursive f(i){ f(i); } function body does bounce i through EAX before copying it to the first arg-passing register. So GCC just really loves EAX.
movl -4(%rbp), %eax
movl %eax, %edi
movl $0, %eax # without a full prototype, pass # of FP args in AL
call f
i++; doesn't load into EAX; it just uses a memory-destination add without loading into a register. Worth trying with gcc -O0 for ARM.

The assembly of “b++”

In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other)：
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits

The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).

The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)

Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.

It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.

if statement in assembly ouput of c code

i have this simple piece of code in c:
#include <stdio.h>
void test() {}
int main()
{
if (2 < 3) {
int zz = 10;
}
return 0;
}
when i see the assembly output of this code:
test():
pushq %rbp
movq %rsp, %rbp
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $10, -4(%rbp) // space is created for zz on stack
movl $0, %eax
popq %rbp
ret
i got the assembly from here (default options)
I can't see where is the instruction for the conditional check?

You don't see it, because it isn't there. The compiler was able to perform analysis, and rather easily see that this branch will always be entered.
Instead of emitting a check that will do nothing but waste CPU cycles, it emits an easily optimized version of the code.
A C program is not a sequence of instructions for the CPU to perform. That's what the emitted machine code is. A C program is a description of the behavior your compiled program should have. A compiler is free to translate it in almost any way it wants, so long as you get that behavior.
It's known as "the as-if rule".

The interesting thing here is that gcc and clang optimize away the if() even at -O0, unlike some other compilers (ICC and MSVC).
gcc -O0 doesn't mean no optimization, it means no extra optimization beyond what's needed to compile at all. But gcc does have to transform through a couple internal representations of the function logic before emitting asm. (GIMPLE and Register Transfer Language). gcc doesn't have a special "dumb mode" where it slavishly transliterates every part of every C expression to asm.
Even a super-simple one-pass compiler like TCC does minor optimizations within an expression (or even a statement), like realizing that an always-true condition doesn't require branching.
gcc -O0 is the default, which you obviously used because the dead store to zz isn't optimized away.
gcc -O0 aims to compile quickly, and to give consistent debugging results.
Every C variable exists in memory, whether it's ever used or not.
Nothing is kept in registers across C statements (except variables declared register; -O0 is the only time that keyword does anything). So you can modify any C variable with a debugger while single-stepping. i.e. spill/reload everything between separate C statements. See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? (This is why benchmarking for -O0 is nonsense: writing the same code with fewer larger expressions is faster only at -O0, not with real settings like -O3).
Other interesting consequences: constant-propagation doesn't work, see Why does integer division by -1 (negative one) result in FPE? for a case where gcc uses div for a variable set to a constant, vs. something simpler for a literal constant.
Every statement is compiled independently, so you can even jump to a different source line (within the same function) using GDB and get consistent results. (Unlike in optimized code where that would be likely to crash or give nonsense, and definitely not match the C abstract machine).
Given all those requirements for gcc -O0 behaviour, if (2 < 3) can still be optimized to zero asm instructions. The behaviour doesn't depend on the value of any variable, and it's a single statement. There's no way it can ever be not-taken, so the simplest way to compile it is no instructions: fall-through into the { body } of the if.
Note that gcc -O0's rules / restrictions go far beyond the C as-if rule that the machine-code for a function merely has to implement all externally-visible behaviour of the C source. gcc -O3 optimizes the whole function down to just
main: # with optimization
xor eax, eax
ret
because it doesn't care about keeping asm for every C statement.
Other compilers:
See all 4 of the major x86 compilers on Godbolt.
clang is similar to gcc, but with a dead store of 0 to another spot on the stack, as well as the 10 for zz. clang -O0 is often closer to a transliteration of C into asm, for example it will use div for x / 2 instead of a shift, while gcc uses a multiplicative inverse for division by a constant even at -O0. But in this case, clang also decides that no instructions are sufficient for an always-true condition.
ICC and MSVC both emit asm for the branch, but instead of the mov $2, %ecx / cmp $3, %ecx you might expect, they both actually do 0 != 1 for no apparent reason:
# ICC18
pushq %rbp #6.1
movq %rsp, %rbp #6.1
subq $16, %rsp #6.1
movl $0, %eax #7.5
cmpl $1, %eax #7.5
je ..B1.3 # Prob 100% #7.5
movl $10, -16(%rbp) #9.16
..B1.3: # Preds ..B1.2 ..B1.1
movl $0, %eax #11.12
leave #11.12
ret #11.12
MSVC uses the xor-zeroing peephole optimization even without optimization enabled.
It's slightly interesting to look at which local / peephole optimizations compilers do even at -O0, but it doesn't tell you anything fundamental about C language rules or your code, it just tells you about compiler internals and the tradeoffs the compiler devs chose between spending time looking for simple optimizations vs. compiling even faster in no-optimization mode.
The asm is never intended to faithfully represent the C source in any kind of way that would let a decompiler reconstruct it. Just to implement equivalent logic.

It's simple. It is not there. The compiler optimized it away.
Here is the assembly when compiling with gcc without optimization:
.file "k.c"
.text
.globl test
.type test, #function
test:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size test, .-test
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $10, -4(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section .note.GNU-stack,"",#progbits
and here it is with optimization:
.file "k.c"
.text
.p2align 4,,15
.globl test
.type test, #function
test:
.LFB11:
.cfi_startproc
rep ret
.cfi_endproc
.LFE11:
.size test, .-test
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB12:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE12:
.size main, .-main
.ident "GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section .note.GNU-stack,"",#progbits
As you can see, not only the comparison is optimized away. Almost the whole main is optimized away since it does not produce anything visible. The variable zz is never used. The only observable thing your code does is returning 0.

2 is always less tan 3 so, as the compiler know the result of 2<3 is always true, there is no need for an if decision in assembler.
The optimization means to generate less time / less code.

if (2<3)
is allways true, therefore the Compiler emmits no opcode for it.

The condition if (2<3) is always true. So a decent compiler would detect this generate the code as if the condition doesn't exist. In fact, if you optimize it with -O3, godbolt.org generates just:
test():
rep ret
main:
xor eax, eax
ret
This is again valid because a compiler is allowed optimise and transform the code as long as the observable behaviour is preserved.

Correct way of unrolling loop using gcc

#include <stdio.h>
int main() {
int i;
for(i=0;i<10000;i++){
printf("%d",i);
}
}
I want to do loop unrolling on this code using gcc
but even using the flag.
gcc -O2 -funroll-all-loops --save-temps unroll.c
the assembled code i am getting contain a loop of 10000 iteration
_main:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
pushq %r14
pushq %rbx
Ltmp2:
xorl %ebx, %ebx
leaq L_.str(%rip), %r14
.align 4, 0x90
LBB1_1:
xorb %al, %al
movq %r14, %rdi
movl %ebx, %esi
callq _printf
incl %ebx
cmpl $10000, %ebx
jne LBB1_1
popq %rbx
popq %r14
popq %rbp
ret
Leh_func_end1:
Can somone plz tell me how to implement loop unrolling correctly in gcc

Loop unrolling won't give you any benefit for this code, because the overhead of the function call to printf() itself dominates the work done at each iteration. The compiler may be aware of this, and since it is being asked to optimize the code, it may decide that unrolling increases the code size for no appreciable run-time performance gain, and decides the risk of incurring an instruction cache miss is too high to perform the unrolling.
The type of unrolling required to speed up this loop would require reducing the number of calls to printf() itself. I am unaware of any optimizing compiler that is capable of doing that.
As an example of unrolling the loop to reduce the number of printf() calls, consider this code:
void print_loop_unrolled (int n) {
int i = -8;
if (n % 8) {
printf("%.*s", n % 8, "01234567");
i += n % 8;
}
while ((i += 8) < n) {
printf("%d%d%d%d%d%d%d%d",i,i+1,i+2,i+3,i+4,i+5,i+6,i+7);
}
}

gcc has maximum loops unroll parameters.
You have to use -O3 -funroll-loops and play with parameters max-unroll-times, max-unrolled-insns and max-average-unrolled-insns.
Example:
-O3 -funroll-loops --param max-unroll-times=200

Replace
printf("%d",i);
with
volatile int j = i;
and see if the loop gets unrolled.

Get Assembly code after every optimization GCC makes?

From Optimization Compiler on Wikipedia,
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources.
and GCC has a lot of optimization options.
I'd like to study the generated assembly (the one -S gives) after each optimization GCC performs when compiling with different flags like -O1, -O2, -O3, etc.
How can I do this?
Edit: My input will be C code.

Intermediate representation can be saved to files using -fdump-tree-all switch.
There are more fine-grained -fdump switches awailable.
See gcc manual for details.
To be able to read these representations, take a look into GCC internals manual.

gcc -S (Capital S)
gives asm output, but the assembler can change things so I prefer to just make an object
gcc -c -o myfile.o myfile.c
then disassemble
objdump -D myfile.o
Understand that this is not linked so external branch destinations and other external addresses will have a placeholder instead of a real number. If you want to see the optimizations compile with no optimizations (-O0) then compile with -O1 then -O2 and -O3 and see what if anything changes. There are other optimzation flags as well you can play with. To see the difference you need to compile with and without the flags and compare the differences yourself.
diff won't work, you will see why (register allocation changes).

Whilst it is possible to take a small piece of code, compile it with -S and with a variety of options, the difficulty is understanding what actually changed. It only takes a small change to make code quite different - one variable going into a register means that register is no longer available for something, causing knock-on effects to all of the remaining code in the function.
I was comparing the same code from two nearly identical functions earlier today (to do with a question on C++), and there was ONE difference in the source code. One change in which variable was used for the termination condition inside one for-loop led to over lines of assembler code changing. Because the compiler decided to arrange the registers in a different way, using a different register for one of the main variables, and then everything else changed as a consequence.
I've seen cases where adding a small change to a function moves it from being inlined to not being inlined, which in turn makes big changes to ALL of the code in the program that calls that code.
So, yes, by all means, compile very simple code with different optimisations, and use -S to inspect the compiler generated code. Then compare the different variants to see what effect it has. But unless you are used to reading assembler code, and understand what you are actually looking for, it can often be hard to see the forest for the trees.
It is also worth considering that optimisation steps often work in conjunction - one step allows another step to do its work (inline leads to branch merging, register usage, and so on).

Compile with switch -S to get the assembly code. This should work for any level of optimization.
For instance, to get the assembly code generated in O2 mode, try:
g++/gcc -S -O2 input.cpp
a corresponding input.s will be generated, which contains the assembly code generated. Repeat this for any optimization level you want.

gcc/clang performs optimizations on the intermediate representations (IR), which can be printed after each optimization pass.
for gcc it is (-fdump-tree-all) 'http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html'
with clang it is (-llvm -print-after-all).
Clang/gcc offers many more options to analyze the optimizations. It is easy to turn on/off an optimization from the command line (http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html, http://llvm.org/docs/Passes.html)
with clang-llvm you can also list the optimization passes which were performed using the command line option (-mllvm -debug-pass=Structure)

If you would like to study compiler optimization and are agnostic to the compiler, then take a look at the Clang/LLVM projects. Clang is a C compiler that can output LLVM IR and LLVM commands can apply specific optimization passes individually.
Output LLVM IR:
clang test.c -S -emit-llvm -o test.ll
Perform optimization pass:
opt test.ll -<optimization_pass> -S -o test_opt.ll
Compile to assembly:
llc test.ll -o test.s

Solution1:
gcc -O1 -S test.c (the capital O and capital S)
Solution2:
This site can also help you. You can use -O0, -O1, .. whatever suitable compiler options to get what you want.
Example from that site: (tested by both solutions)
void maxArray(double* x, double* y) {
for (int i = 0; i < 65536; i++) {
if (y[i] > x[i]) x[i] = y[i];
}
}
Compier Option -O0:
result:
maxArray(double*, double*):
pushq %rbp
movq %rsp, %rbp
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L5:
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -32(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1
ucomisd %xmm1, %xmm0
jbe .L3
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rcx
movq -32(%rbp), %rax
addq %rcx, %rax
movq (%rax), %rax
movq %rax, (%rdx)
.L3:
addl $1, -4(%rbp)
.L2:
cmpl $65535, -4(%rbp)
jle .L5
popq %rbp
ret
Compiler Option -O1:
result:
maxArray(double*, double*):
movl $0, %eax
.L5:
movsd (%rsi,%rax), %xmm0
ucomisd (%rdi,%rax), %xmm0
jbe .L2
movsd %xmm0, (%rdi,%rax)
.L2:
addq $8, %rax
cmpq $524288, %rax
jne .L5
rep; ret

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight