I wrote this simple C program:
int main() {
int i;
int count = 0;
for(i = 0; i < 2000000000; i++){
count = count + 1;
}
}
I wanted to see how the gcc compiler optimizes this loop (clearly add 1 2000000000 times should be "add 2000000000 one time"). So:
gcc test.c and then time on a.out gives:
real 0m7.717s
user 0m7.710s
sys 0m0.000s
$ gcc -O2 test.c and then time ona.out` gives:
real 0m0.003s
user 0m0.000s
sys 0m0.000s
Then I disassembled both with gcc -S. First one seems quite clear:
.file "test.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl $0, -8(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L3:
addl $1, -8(%rbp)
addl $1, -4(%rbp)
.L2:
cmpl $1999999999, -4(%rbp)
jle .L3
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",#progbits
L3 adds, L2 compare -4(%rbp) with 1999999999 and loops to L3 if i < 2000000000.
Now the optimized one:
.file "test.c"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
rep
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",#progbits
I can't understand at all what's going on there! I've got little knowledge of assembly, but I expected something like
addl $2000000000, -8(%rbp)
I even tried with gcc -c -g -Wa,-a,-ad -O2 test.c to see the C code together with the assembly it was converted to, but the result was no more clear that the previous one.
Can someone briefly explain:
The gcc -S -O2 output.
If the loop is optimized as I expected (one sum instead of many sums)?
The compiler is even smarter than that. :)
In fact, it realizes that you aren't using the result of the loop. So it took out the entire loop completely!
This is called Dead Code Elimination.
A better test is to print the result:
#include <stdio.h>
int main(void) {
int i; int count = 0;
for(i = 0; i < 2000000000; i++){
count = count + 1;
}
// Print result to prevent Dead Code Elimination
printf("%d\n", count);
}
EDIT : I've added the required #include <stdio.h>; the MSVC assembly listing corresponds to a version without the #include, but it should be the same.
I don't have GCC in front of me at the moment, since I'm booted into Windows. But here's the disassembly of the version with the printf() on MSVC:
EDIT : I had the wrong assembly output. Here's the correct one.
; 57 : int main(){
$LN8:
sub rsp, 40 ; 00000028H
; 58 :
; 59 :
; 60 : int i; int count = 0;
; 61 : for(i = 0; i < 2000000000; i++){
; 62 : count = count + 1;
; 63 : }
; 64 :
; 65 : // Print result to prevent Dead Code Elimination
; 66 : printf("%d\n",count);
lea rcx, OFFSET FLAT:??_C#_03PMGGPEJJ#?$CFd?6?$AA#
mov edx, 2000000000 ; 77359400H
call QWORD PTR __imp_printf
; 67 :
; 68 :
; 69 :
; 70 :
; 71 : return 0;
xor eax, eax
; 72 : }
add rsp, 40 ; 00000028H
ret 0
So yes, Visual Studio does this optimization. I'd assume GCC probably does too.
And yes, GCC performs a similar optimization. Here's an assembly listing for the same program with gcc -S -O2 test.c (gcc 4.5.2, Ubuntu 11.10, x86):
.file "test.c"
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "%d\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $2000000000, 8(%esp)
movl $.LC0, 4(%esp)
movl $1, (%esp)
call __printf_chk
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",#progbits
Compilers have a few tools at their disposal to make code more efficient or more "efficient":
If the result of a computation is never used, the code that performs the computation can be omitted (if the computation acted upon volatile values, those values must still be read but the results of the read may be ignored). If the results of the computations that fed it weren't used, the code that performs those can be omitted as well. If such omission makes the code for both paths on a conditional branch identical, the condition may be regarded as unused and omitted. This will have no effect on the behaviors (other than execution time) of any program that doesn't make out-of-bounds memory accesses or invoke what Annex L would call "Critical Undefined Behaviors".
If the compiler determines that the machine code that computes a value can only produce results in a certain range, it may omit any conditional tests whose outcome could be predicted on that basis. As above, this will not affect behaviors other than execution time unless code invokes "Critical Undefined Behaviors".
If the compiler determines that certain inputs would invoke any form of Undefined Behavior with the code as written, the Standard would allow the compiler to omit any code which would only be relevant when such inputs are received, even if the natural behavior of the execution platform given such inputs would have been benign and the compiler's rewrite would make it dangerous.
Good compilers do #1 and #2. For some reason, however, #3 has become fashionable.
Related
I was putting together a C riddle for a couple of my friends when a friend drew my attention to the fact that the following snippet (which happens to be part of the riddle I'd been writing) ran differently when compiled and run on OSX
#include <stdio.h>
#include <string.h>
int main()
{
int a = 10;
volatile int b = 20;
volatile int c = 30;
int data[3];
memcpy(&data, &a, sizeof(data));
printf("%d %d %d\n", data[0], data[1], data[2]);
}
What you'd expect the output to be is 10 20 30, which happens to be the case under Linux, but when the code is built under OSX you'd get 10 followed by two random numbers. After some debugging and looking at the compiler-generated assembly I came to the conclusion that this is due to how the stack is built. I am by no means an assembly expert, but the assembly code generated on Linux seems pretty straightforward to understand while the one generated on OSX threw me off a little. Perhaps I could use some help from here.
This is the code that was generated on Linux:
.file "code.c"
.section .text.unlikely,"ax",#progbits
.LCOLDB0:
.section .text.startup,"ax",#progbits
.LHOTB0:
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB23:
.cfi_startproc
movl $10, -12(%rsp)
xorl %eax, %eax
movl $20, -8(%rsp)
movl $30, -4(%rsp)
ret
.cfi_endproc
.LFE23:
.size main, .-main
.section .text.unlikely
.LCOLDE0:
.section .text.startup
.LHOTE0:
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
.section .note.GNU-stack,"",#progbits
And this is the code that was generated on OSX:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 12
.globl _main
.p2align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
subq $16, %rsp
movl $20, -8(%rbp)
movl $30, -4(%rbp)
leaq L_.str(%rip), %rdi
movl $10, %esi
xorl %eax, %eax
callq _printf
xorl %eax, %eax
addq $16, %rsp
popq %rbp
retq
.cfi_endproc
.section __TEXT,__cstring,cstring_literals
L_.str: ## #.str
.asciz "%d %d %d\n"
.subsections_via_symbols
I'm really only interested in two questions here.
Why is this happening?
Are there any get-arounds to this issue?
I know this is not a practical way to utilize the stack as I'm a professional C developer, which is really the only reason I found this problem interesting to invest some of my time into.
Accessing memory past the end of a declared variable is undefined behaviour - there is no guarantee as to what will happen when you try to do that. Because of how the compiler generated the assembly under Linux, you happened to get the 3 variables directly in a row on the stack, however that behaviour is just a coincidence - the compiler could legally add extra data in between the variables on the stack or really do anything - the result is not defined by the language standard. So in answer to your first question, it's happening because what you're doing is not part of the language by design. In answer to your second, there's no way to reliably get the same result from multiple compilers because the compilers are not programmed to reliably reproduce undefined behaviour.
undefined behavior. You don't expect to copy 10, 20 ,30. You hope not to seg-fault.
There is nothing to guarantee that a,b, and c are sequential memory addresses, which is your naive assumption. On Linux, the compiler happened to make them sequential. You can't even rely on gcc always doing that.
You already know that the behavior is undefined. A good reason for the behavior to be different on OS/X and Linux is these systems use a different compiler, that generates different code:
When you run gcc in Linux, you invoke the installed version the Gnu C compiler.
When you run gcc in your version of OS/X, you most likely invoke the installed version of clang.
Try gcc --version on both systems and amaze your friends.
I am using GCC in 32-bit mode on a Windows 7 machine under cygwin. I have the following function:
unsigned f1(unsigned x, unsigned y)
{
return x*y;
}
I want the code to do an unsigned multiply and as such I would expect it to generate the mul instruction, not the imul instruction. I compile the program
with the following command:
gcc -m32 -S t4.c
The generated assembly code is:
.file "t4.c"
.text
.globl _f1
.def _f1; .scl 2; .type 32; .endef
_f1:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
imull 12(%ebp), %eax
popl %ebp
ret
.ident "GCC: (GNU) 4.8.2"
I believe that the generated code has the wrong multiply instruction in it but I find it hard to believe that GCC has such a simple bug. Please comment.
The compiler relies on the "as-if" rule: No standard conforming program can detect a difference between what this program does and what the program should do, since the lowest 32 bits of the result are the same for both instructions.
This question already has answers here:
Is the gcc insane optimisation level (-O3) not insane enough?
(2 answers)
Closed 9 years ago.
When compiling a simple function that does not even alter the ebp register GCC still saves the value at the start of the function and then restores the same value at the end:
#add.c
int add( int a, int b )
{
return ( a + b );
}
gcc -c -S -m32 -O3 add.c -o add.S
#add.S
.file "add.c"
.text
.p2align 4,,15
.globl add
.type add, #function
add:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %eax
addl 8(%ebp), %eax
popl %ebp
ret
.size add, .-add
.ident "GCC: (GNU) 4.4.6"
.section .note.GNU-stack,"",#progbits
It would seem like a simple optimisation to leave ebp untouched, calculate offsets relative to esp and save 3 instructions.
Why does GCC not do this?
Thanks,
Andrew
Tools such as debuggers and stack walkers used to expect code to have a prologue that constructed a frame pointer, and couldn't understand code that didn't have it. Over time, the restriction has been removed.
The compiler itself has no difficulty generating code without a frame pointer, and you can ask for it to be removed with -fomit-frame-pointer. I believe that recent versions of gcc (~4.8) and gcc on x86-64 omit the frame pointer by default.
(C) If I have a function that contains an if which, if the condition is true can then return a certain value an then else return a different value. Is it more or less efficient to use an else or not bother?
i.e. ...
int foo (int a) {
if ((a > 0) && (a < SOME_LIMIT)) {
b = a //maybe b is some global or something
return 0;
} else {
return 1;
}
return 0;
}
or just
int foo (int a) {
if ((a > 0) && (a < SOME_LIMIT)) {
b = a //maybe b is some global or something
return 0;
}
return 1;
}
Assume GCC, will the first implementation result in compiled code being any different from the second one?
I need to be as efficient as possible here, so possible reduction of a branch for the else would be nice - but stylistically my inner OCD doesn't like to see a return that isnt 0 or void as the last instruction in a function as its less clear what is going on. So if it will be got rid of anyway then I could leave the else there...
You can run gcc with the -O3 -S options to generate optimized assembly code, so you can see (and compare) the optimized assembly. I made the following changes to your sources to make them compile.
File a.c:
int b;
int foo (int a) {
if ((a > 0) && (a < 5000)) {
b = a;
return 0;
} else {
return 1;
}
return 0;
}
File b.c:
int b;
int foo (int a) {
if ((a > 0) && (a < 5000)) {
b = a;
return 0;
}
return 1;
}
When compiling a.c with gcc -O3 -S a.c the file a.s is created. On my machine it looks like this:
.file "a.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
movl 4(%esp), %edx
movl $1, %eax
leal -1(%edx), %ecx
cmpl $4998, %ecx
ja .L2
movl %edx, b
xorb %al, %al
.L2:
rep
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.comm b,4,4
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
When compiling b.c with gcc -O3 -S b.c the file b.s is created. On my machine it looks like this:
.file "b.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
movl 4(%esp), %edx
movl $1, %eax
leal -1(%edx), %ecx
cmpl $4998, %ecx
ja .L2
movl %edx, b
xorb %al, %al
.L2:
rep
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.comm b,4,4
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
Notice that the assembled implementations of foo: are identical. So, in this case, with this version of GCC, it does not matter which way you write the code.
Check the object file between both implementations. Put an assembly header such as
PortionInQuestion:
That will then show up in your assembly file as a label and you can see how the assembly generated is different. They might not be different at all (because of the optimizations), or they could be completely different. Without seeing hte raw assembly, there's no way to tell how the compiler is optimizing it.
Of course it depends on the compiler. I guess every (decent) compiler on the market will produce exactly the same output in both cases. However trying such micro-optimizations is disencouraged by any book about optimization!
Choose the form which is better readable.
I would write it like this...
int foo (int a) {
if ((a > 0) && (a < SOME_LIMIT)) {
b = a //maybe b is some global or something
return 0;
} else {
return 1;
}
}
The whole function is just a boolean to condition the value of b to be bigger than zero and less than some constant. And the if statement conditions the return. There is no need to add a default return to the function. The default return will invalidate the if condition.
You won't get to the last return 0; in the first example. I would say your second one is stylistically clearer at the very least because of that. Less code for the same thing is usually a good thing.
With regards to the performance, you can check out the assembly executable if you fancy such a thing or profile the code and see if there's an actual difference. My bet is none that matters.
Finally, if your compiler supports optimisation flags, use them!
int foo (int a) {
/* Nothing to do: get out of here */
if (a <= 0 || a >= SOME_LIMIT) return 1;
b = a; // maybe b is some global or something
return 0;
}
There are barely any differences with respect to eficiency (the costliest part is the function call plus the return anyway).
For human readers, the least "indented" code (such as the above) is the easiest to read and understand.
BTW
The generated assembler also looks pretty minimal to me, and completely equivalent to the if (a >0 && a < LIMIT) form.
.file "return.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
leal -1(%rdi), %edx
movl $1, %eax
cmpl $2998, %edx
ja .L2
movl %edi, b(%rip)
xorb %al, %al
.L2:
rep
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.globl b
.bss
.align 16
.type b, #object
.size b, 4
b:
.zero 4
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
The C function backtrace just returns a series of functions calls for the programn, but i want to list all the locals variables in my programn, just like the info locals in gdb.Any idea if this can be done? Thanks
Generally, no. You should move away from thinking about a "stack" as some sort of god given factum. A call stack is merely a common implementation technique for C. It has no intrinsic meaning or required semantics. Automatic variables ("local variables", as you say) have to behave in a certain way, and sometimes that means that they are written onto the call stack. However, it is entirely conceivable that local variables are never realized in memory at all -- they may instead only ever be stored in a processor register, or eliminated entirely if an equivalent program can be formulated without them.
So, no, there is no language-intrinsic mechanism for enumerating local variables. As you say, the debugger can do so to some extent (depending on debug symbols being present and subject to optimizations); perhaps you can find a library that can process debug symbols from within a running program.
If this is just for occasional debugging, then you can invoke the debugger. However, since the debugger itself will freeze your program, you need an intermediary to capture the output. You can, for example, use system, and redirect the output to a file, then read the file afterwards. In the example below, the file gdbcmds.txt contains the line info locals.
char buf[512];
FILE *gdb;
snprintf(buf, sizeof(buf), "gdb -batch -x gdbcmds.txt -p %d > gdbout.txt",
(int)getpid());
system(buf);
gdb = fopen("gdbout.txt", "r");
while (fgets(buf, sizeof(buf), gdb) != 0) {
printf("%s", buf);
}
fclose(gdb);
First, note that backtrace is not a standard C library function, but a GNU-specific extension.
In general, it's difficult to impossible retrieve local variable information from compiled code, especially if it was compiled without debugging or with optimization enabled. If debugging isn't turned on, variable names and types are generally not preserved in the resulting machine code.
For example, take the following ridiculously simple code:
#include <stdio.h>
#include <math.h>
int main(void)
{
int x = 1, y = 2, z;
z = 2 * y - x;
printf("x = %d, y = %d, z = %d\n", x, y, z);
return 0;
}
Here's the resulting machine code, no debugging or optimization:
.file "varinfo.c"
.version "01.01"
gcc2_compiled.:
.section .rodata
.LC0:
.string "x = %d, y = %d, z = %d\n"
.text
.align 4
.globl main
.type main,#function
main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $1, -4(%ebp)
movl $2, -8(%ebp)
movl -8(%ebp), %eax
movl %eax, %eax
sall $1, %eax
subl -4(%ebp), %eax
movl %eax, -12(%ebp)
pushl -12(%ebp)
pushl -8(%ebp)
pushl -4(%ebp)
pushl $.LC0
call printf
addl $16, %esp
movl $0, %eax
leave
ret
.Lfe1:
.size main,.Lfe1-main
.ident "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)"
x, y, and z are referred to through -4(%ebp), -8(%ebp), and -12(%ebp) respectively. There's nothing to indicate that they're integers other than the instructions used to perform the arithmetic.
It's even better with optimization (-O1) turned on:
.file "varinfo.c"
.version "01.01"
gcc2_compiled.:
.section .rodata.str1.1,"ams",#progbits,1
.LC0:
.string "x = %d, y = %d, z = %d\n"
.text
.align 4
.globl main
.type main,#function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
pushl $3
pushl $2
pushl $1
pushl $.LC0
call printf
movl $0, %eax
leave
ret
.Lfe1:
.size main,.Lfe1-main
.ident "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)"
In this case, the compiler was able to do some static analysis and compute the value z at compile time; there's no need to set aside any memory for any of the variables at all, because the compiler already knows what those values have to be.