x86 calling convention: should arguments passed by stack be read-only?

x86 calling convention: should arguments passed by stack be read-only? - c

It seems state-of-art compilers treat arguments passed by stack as read-only. Note that in the x86 calling convention, the caller pushes arguments onto the stack and the callee uses the arguments in the stack. For example, the following C code:
extern int goo(int *x);
int foo(int x, int y) {
goo(&x);
return x;
}
is compiled by clang -O3 -c g.c -S -m32 in OS X 10.10 into:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
movl %eax, -4(%ebp)
leal -4(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl -4(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
Here, the parameter x(8(%ebp)) is first loaded into %eax; and then stored in -4(%ebp); and the address -4(%ebp) is stored in %eax; and %eax is passed to the function goo.
I wonder why Clang generates code that copy the value stored in 8(%ebp) to -4(%ebp), rather than just passing the address 8(%ebp) to the function goo. It would save memory operations and result in a better performance. I observed a similar behaviour in GCC too (under OS X). To be more specific, I wonder why compilers do not generate:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl 8(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
I searched for documents if the x86 calling convention demands the passed arguments to be read-only, but I couldn't find anything on the issue. Does anybody have any thought on this issue?

The rules for C are that parameters must be passed by value. A compiler converts from one language (with one set of rules) to a different language (potentially with a completely different set of rules). The only limitation is that the behaviour remains the same. The rules of the C language do not apply to the target language (e.g. assembly).
What this means is that if a compiler feels like generating assembly language where parameters are passed by reference and are not passed by value; then this is perfectly legal (as long as the behaviour remains the same).
The real limitation has nothing to do with C at all. The real limitation is linking. So that different object files can be linked together, standards are needed to ensure that whatever the caller in one object file expects matches whatever the callee in another object file provides. This is what's known as the ABI. In some cases (e.g. 64-bit 80x86) there are multiple different ABIs for the exact same architecture.
You can even invent your own ABI that's radically different (and implement your own tools that support your own radically different ABI) and that's perfectly legal as far as the C standards go; even if your ABI requires "pass by reference" for everything (as long as the behaviour remains the same).

Actually, I just compiled this function using GCC:
int foo(int x)
{
goo(&x);
return x;
}
And it generated this code:
_foo:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
call _goo
movl 8(%ebp), %eax
leave
ret
This is using GCC 4.9.2 (on 32-bit cygwin if it matters), no optimizations. So in fact, GCC did exactly what you thought it should do and used the argument directly from where the caller pushed it on the stack.

The C programming language mandates that arguments are passed by value. So any modification of an argument (like an x++; as the first statement of your foo) is local to the function and does not propagate to the caller.
Hence, a general calling convention should require copying of arguments at every call site. Calling conventions should be general enough for unknown calls, e.g. thru a function pointer!
Of course, if you pass an address to some memory zone, the called function is free to dereference that pointer, e.g. as in
int goo(int *x) {
static int count;
*x = count++;
return count % 3;
}
BTW, you might use link-time optimizations (by compiling and linking with clang -flto -O2 or gcc -flto -O2) to perhaps enable the compiler to improve or inline some calls between translation units.
Notice that both Clang/LLVM and GCC are free software compilers. Feel free to propose an improvement patch to them if you want to (but since both are very complex pieces of software, you'll need to work some months to make that patch).
NB. When looking into produced assembly code, pass -fverbose-asm to your compiler!

Related

The assembly of “b++”

In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other)：
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits

The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).

The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)

Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.

It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.

Motivation for useless prologue in gcc-compiled main(), disabling it?

Given the following minimal test case:
void exit(int);
int main() {
exit(0);
}
GCC 4.9 and later with 32-bit x86 target produces something like:
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4, %esp
subl $12, %esp
pushl $0
call exit
Note the convoluted stack-realignment code. With the function renamed to anything but main, however, it gives the (much more reasonable):
xmain:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
subl $12, %esp
pushl $0
call exit
The differences are even more pronounced with -O. As main nothing changes; renamed, it yields:
xmain:
subl $24, %esp
pushl $0
call exit
The above was noticed in answering this question:
How do i get rid of call __x86.get_pc_thunk.ax
Is this behavior (and its motivation) documented anywhere, and is there any way to suppress it? GCC has x86 target-specific options to set the preferred/assumed incoming and outgoing stack alignment and enable/disable realignment for arbitrary functions, but they don't seem to be honored for main.

This answer is based on source diving. I do not know what the developers' intentions or motivations were. All of the code involved seems to date to 2008ish, which is after my own time working on GCC, but long enough ago that people's memories have probably gotten fuzzy. (GCC 4.9 was released in 2014; did you go back any farther than that? If I'm right about when this code was introduced, the clumsy stack alignment for main should start happening in version 4.4.)
GCC's x86 back end appears to have been coded to make extra-conservative assumptions about the stack alignment on entry to main, regardless of command-line options. The function ix86_minimum_incoming_stack_boundary is called to compute the expected stack alignment on entry for each function, and the last thing it does ...
12523 /* Stack at entrance of main is aligned by runtime. We use the
12524 smallest incoming stack boundary. */
12525 if (incoming_stack_boundary > MAIN_STACK_BOUNDARY
12526 && DECL_NAME (current_function_decl)
12527 && MAIN_NAME_P (DECL_NAME (current_function_decl))
12528 && DECL_FILE_SCOPE_P (current_function_decl))
12529 incoming_stack_boundary = MAIN_STACK_BOUNDARY;
12530
12531 return incoming_stack_boundary;
... is override the expected stack alignment to a conservative constant, MAIN_STACK_BOUNDARY, if the function being compiled is main. MAIN_STACK_BOUNDARY is 128 (bits) when compiling 64-bit code and 32 when compiling 32-bit code. As far as I can tell, there is no command-line knob that will make it expect the stack to be more aligned than that on entry to main. I can persuade it to skip stack alignment for main by telling it that no additional alignment is needed, compiling your test program with -m32 -mpreferred-stack-boundary=2 gives me
main:
pushl $0
call exit
with GCC 7.3.
The write-only manipulations of %ecx appear to be a missed-optimization bug. They are coming from this part of ix86_expand_prologue:
13695 /* Grab the argument pointer. */
13696 t = plus_constant (Pmode, stack_pointer_rtx, m->fs.sp_offset);
13697 insn = emit_insn (gen_rtx_SET (crtl->drap_reg, t));
13698 RTX_FRAME_RELATED_P (insn) = 1;
13699 m->fs.cfa_reg = crtl->drap_reg;
13700 m->fs.cfa_offset = 0;
13701
13702 /* Align the stack. */
13703 insn = emit_insn (ix86_gen_andsp (stack_pointer_rtx,
13704 stack_pointer_rtx,
13705 GEN_INT (-align_bytes)));
13706 RTX_FRAME_RELATED_P (insn) = 1;
13707
The intention is to save a pointer to the incoming argument area before realigning the stack, so that it is straightforward to access arguments. Either because this happens fairly late in the pipeline (after register allocation), or because the instructions are marked FRAME_RELATED, nothing manages to delete those instructions again when they turn out to be unnecessary.
I imagine the GCC devs would at least listen to a bug report about this, but they might reasonably consider it low priority, because these are instructions that are executed only once in the lifetime of the whole program, they're only actually dead when main doesn't use its arguments, and they only happen in the traditional 32-bit ABI, which I have the impression is considered a second-class target nowadays.

main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4, %esp
The above section replicates the invoking stack frame, which since you haven’t defined any arguments to main() consists of just the return address -4(%ecx) and frame pointer, into a $16 byte aligned stack; thus my WAG is that this is to accomodate runtimes (crt0.s) that do not align the stack properly.
The push %ebp was a bit of a giveaway -- it establishes a consistent looking backtrace through crt0.s despite this trampoline.
This is just a ‘normal’ call of exit, with the stack properly aligned...
subl $12, %esp
pushl $0
call exit

Bootloader - Display String Runtime Error

I am going to write my first "hello world" bootloader program.I found an article on CodeProject website.Here is link of it.
http://www.codeproject.com/Articles/664165/Writing-a-boot-loader-in-Assembly-and-C-Part
Up-to assembly level programming it was going well, but when I wrote program using c,same as given in this article, I faced a runtime error.
Code written in my .c file is as below.
__asm__(".code16\n");
__asm__("jmpl $0x0000,$main\n");
void printstring(const char* pstr)
{
while(*pstr)
{
__asm__ __volatile__("int $0x10": :"a"(0x0e00|*pstr),"b"(0x0007));
++pstr;
}
}
void main()
{
printstring("Akatsuki9");
}
I created image file floppy.img and checking output using bochs.
It was displaying something like this
Booting from floppy...
S
It should be Akatsuki9. I don't know where did I mistake? Can any one help me to find why am I facing this runtime error?

Brief Answer: The problem is with gcc (in fact, this specific application of generated code) and not with the C program itself. It's hidden in the assembly code.
Long Answer: A longer (more elaborate) explanation with specific details of the problem:
(It would be helpful to have the assembly code. It can be obtained using the -S switch of gcc or use the one that I got from gcc; I've attached it at the end). If you don't already know about opcode-prefixing, c-parameter passing in assembly, etc. then have a look at the following background information section. Looking at the assembly source, it's evident that it's 32bit code. gcc with '.code16' produces 16bit code for 32bit-mode processor (using operand-size prefixes). When this same exact code is run in real (i.e. 16bit) mode, it is treated as 32bit code. This is not an issue (80386 and later processors can execute it as such, previous processors just ignore the operand-size prefix). The problem occurs because gcc calculates offsets based on 32bit-mode of (processor) operation, which is not true (by default) when executing boot-code.
Some background information (experienced assembly language programmers should skip this):
1. Operand-size prefix: In x86, prefix bytes (0x66, 0x67, etc.) are used to obtain variants of an instruction. 0x66 is the operand-size prefix to obtain instruction for non-default operand size; gas uses this technique to produce code for '.code16'. For example, in real (i.e. 16bit) mode, 89 D8 corresponds to movw %bx,%ax while 66 89 D8 corresponds to movl %ebx,%eax. This relationship gets reversed in 32bit mode.
2. parameter passing in C: Parameters are passed on stack and accessed through the EBP register.
3. Call instruction: Call is a branching operation with the next instruction's address saved on stack (for resuming). near Call saves only the IP (when in 16bit mode) or EIP ( when in 32bit mode). far Call saves the CS (code-segment register) along with IP/EIP.
4. Push operation: Saves the value on stack. The size of object is subtracted from ESP.
Exact problem
We start at the
movl %esp, %ebp in main: {{ %ebp is set equal to %esp }}
pushl $.LC0 subtracts 4 from Stack Pointer {{ .LC0 addresses the char* "Akatsuki9"; it is getting saved on stack (to be accessed by printstring function) }}
call printstring subtracts 2 from Stack Pointer (16bit Mode; IP is 2bytes)
pushl %ebp in printstring: {{ 4 is subtracted from %esp }}
movl %esp, %ebp {{ %ebp and %esp are currently at 2+4(=6) bytes from the char *pstr }}
pushl %ebx changes %esp but not %ebp
movl 8(%ebp), %edx {{ Accessing 'pstr' at %ebp+8 ??? }}
Accessing 'pstr' at %ebp+8 instead of %ebp+6 (gcc had calculated an offset of 8, assuming 32bit EIP); the program has just obtained an invalid pointer and it's going to cause problem when the program dereferences it later: movsbl (%edx), %eax.
Fix
As of now I don't know of a good fix for this that will work with gcc. For writing boot-sector code, a native 16bit code-generator, I think, is more effective (size-limit & other quirks as explained above). If you insist on using gcc which currently only generates code for 32bit mode, the fix would be to avoid passing function parameters. For more information, refer to the gcc and gas manuals. And please let me know if there is a workaround or some option that works with gcc.
EDIT
I have found a fix for the program to make it work for the desired purpose while still using gcc. Kinda hackish & clearly not-recommended. Why post then? Well, sort of proof of concept. Here it is: (just replace your printstring function with this one)
void printstring(const char* pstr)
{
const char *hackPtr = *(const char**)((char *)&pstr-2);
while(*hackPtr)
{
__asm__ __volatile__("int $0x10": :"a"(0x0e00|*hackPtr),"b"(0x0007));
++hackPtr;
}
}
I invite #Akatsuki and others (interested) to verify that it works. From my above answer and the added C-pointer arithmetic, you can see why it should.
My Assembly-Source file
.file "bootl.c"
#APP
.code16
jmpl $0x0000,$main
#NO_APP
.text
.globl printstring
.type printstring, #function
printstring:
.LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
pushl %ebx
.cfi_offset 3, -12
movl 8(%ebp), %edx
movl $7, %ebx
.L2:
movsbl (%edx), %eax
testb %al, %al
je .L6
orb $14, %ah
#APP
# 8 "bootl.c" 1
int $0x10
# 0 "" 2
#NO_APP
incl %edx
jmp .L2
.L6:
popl %ebx
.cfi_restore 3
popl %ebp
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size printstring, .-printstring
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "Akatsuki9"
.section .text.startup,"ax",#progbits
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
pushl $.LC0
call printstring
popl %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits

I have the same problem, and found a solution that may work for you. It works on the emulators (I tested on bochs and qemu), but can't make it work on real hardware.
Solution
One thing is to use gcc-4.9.2, and to change the code generation to .code16gcc.
Thus, your code becomes:
__asm__(".code16gcc\n");
__asm__("jmpl $0x0000,$main\n");
void printstring(const char* pstr)
{
while(*pstr)
{
__asm__ __volatile__("int $0x10": :"a"(0x0e00|*pstr),"b"(0x0007));
++pstr;
}
}
void main()
{
printstring("Akatsuki9");
}
and to compile it use the -m16 flag on gcc, in my case I tried
gcc -c -m16 file.c
Note that you can change the architecture according to your needs, by setting -march. Or if you want to keep the flags of the tutorial
gcc -c -g -Os -march=i386 -ffreestanding -Wall -Werror -m16 file.c
tl;dr
Set .code16gcc instead of .code16, and use -m16 with gcc-4.9.2.

Can/do C compilers optimize out adress-of in inline functions?

Let's say I have following code:
int f() {
int foo = 0;
int bar = 0;
foo++;
bar++;
// many more repeated operations in actual code
foo++;
bar++;
return foo+bar;
}
Abstracting repeated code into a separate functions, we get
static void change_locals(int *foo_p, int *bar_p) {
*foo_p++;
*bar_p++;
}
int f() {
int foo = 0;
int bar = 0;
change_locals(&foo, &bar);
change_locals(&foo, &bar);
return foo+bar;
}
I'd expect the compiler to inline the change_locals function, and optimize things like *(&foo)++ in the resulting code to foo++.
If I remember correctly, taking address of a local variable usually prevents some optimizations (e.g. it can't be stored in registers), but does this apply when no pointer arithmetic is done on the address and it doesn't escape from the function? With a larger change_locals, would it make a difference if it was declared inline (__inline in MSVC)?
I am particularly interested in behavior of GCC and MSVC compilers.

inline (and all its cousins _inline, __inline...) are ignored by gcc. It might inline anything it decides is an advantage, except at lower optimization levels.
The code procedure by gcc -O3 for x86 is:
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
.ident "GCC: (GNU) 4.4.4 20100630 (Red Hat 4.4.4-10)"
It returns zero because *ptr++ doesn't do what you think. Correcting the increments to:
(*foo_p)++;
(*bar_p)++;
results in
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
movl $4, %eax
movl %esp, %ebp
popl %ebp
ret
So it directly returns 4. Not only did it inline them, but it optimized the calculations away.
Vc++ from vs 2005 provides similar code, but it also created unreachable code for change_locals(). I used the command line
/O2 /FD /EHsc /MD /FA /c /TP

If I remember correctly, taking
address of a local variable usually
prevents some optimizations (e.g. it
can't be stored in registers), but
does this apply when no pointer
arithmetic is done on the address and
it doesn't escape from the function?
The general answer is that if the compiler can ensure that no one else will change a value behind its back, it can safely be placed in a register.
Think of this as though the compiler first performs inlining, then transforms all those *&foo (which results from the inlining) to simply foo before deciding if they should be placed in registers on in memory on the stack.
With a larger change_locals, would it
make a difference if it was declared
inline (__inline in MSVC)?
Again, generally speaking, whether or not a compiler decides to inline something is done using heuristics. If you explicitly specify that you want something to be inlines, the compiler will probably weight this into its decision process.

I've tested gcc 4.5, MSC and IntelC using this:
#include <stdio.h>
void change_locals(int *foo_p, int *bar_p) {
(*foo_p)++;
(*bar_p)++;
}
int main() {
int foo = printf("");
int bar = printf("");
change_locals(&foo, &bar);
change_locals(&foo, &bar);
printf( "%i\n", foo+bar );
}
And all of them did inline/optimize the foo+bar value, but also did
generate the code for change_locals() (but didn't use it).
Unfortunately, there's still no guarantee that they'd do the same for
any kind of such a "local function".
gcc:
__Z13change_localsPiS_:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
incl (%edx)
incl (%eax)
leave
ret
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
pushl %ebx
subl $28, %esp
call ___main
movl $LC0, (%esp)
call _printf
movl %eax, %ebx
movl $LC0, (%esp)
call _printf
leal 4(%ebx,%eax), %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
xorl %eax, %eax
addl $28, %esp
popl %ebx
leave
ret

Decoding equivalent assembly code of C code

Wanting to see the output of the compiler (in assembly) for some C code, I wrote a simple program in C and generated its assembly file using gcc.
The code is this:
#include <stdio.h>
int main()
{
int i = 0;
if ( i == 0 )
{
printf("testing\n");
}
return 0;
}
The generated assembly for it is here (only the main function):
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
leave
ret
I am at an absolute loss to correlate the C code and assembly code. All that the code has to do is store 0 in a register and compare it with a constant 0 and take suitable action. But what is going on in the assembly?

Since main is special you can often get better results by doing this type of thing in another function (preferably in it's own file with no main). For example:
void foo(int x) {
if (x == 0) {
printf("testing\n");
}
}
would probably be much more clear as assembly. Doing this would also allow you to compile with optimizations and still observe the conditional behavior. If you were to compile your original program with any optimization level above 0 it would probably do away with the comparison since the compiler could go ahead and calculate the result of that. With this code part of the comparison is hidden from the compiler (in the parameter x) so the compiler can't do this optimization.
What the extra stuff actually is
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
This is setting up a stack frame for the current function. In x86 a stack frame is the area between the stack pointer's value (SP, ESP, or RSP for 16, 32, or 64 bit) and the base pointer's value (BP, EBP, or RBP). This is supposedly where local variables live, but not really, and explicit stack frames are optional in most cases. The use of alloca and/or variable length arrays would require their use, though.
This particular stack frame construction is different than for non-main functions because it also makes sure that the stack is 16 byte aligned. The subtraction from ESP increases the stack size by more than enough to hold local variables and the andl effectively subtracts from 0 to 15 from it, making it 16 byte aligned. This alignment seems excessive except that it would force the stack to also start out cache aligned as well as word aligned.
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
I don't know what all this does. alloca increases the stack frame size by altering the value of the stack pointer.
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
I think you know what this does. If not, the movl just befrore the call is moving the address of your string into the top location of the stack so that it may be retrived by printf. It must be passed on the stack so that printf can use it's address to infer the addresses of printf's other arguments (if any, which there aren't in this case).
leave
This instruction removes the stack frame talked about earlier. It is essentially movl %ebp, %esp followed by popl %ebp. There is also an enter instruction which can be used to construct stack frames, but gcc didn't use it. When stack frames aren't explicitly used, EBP may be used as a general puropose register and instead of leave the compiler would just add the stack frame size to the stack pointer, which would decrease the stack size by the frame size.
ret
I don't need to explain this.
When you compile with optimizations
I'm sure you will recompile all fo this with different optimization levels, so I will point out something that may happen that you will probably find odd. I have observed gcc replacing printf and fprintf with puts and fputs, respectively, when the format string did not contain any % and there were no additional parameters passed. This is because (for many reasons) it is much cheaper to call puts and fputs and in the end you still get what you wanted printed.

Don't worry about the preamble/postamble - the part you're interested in is:
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
It should be pretty self-evident as to how this correlates with the original C code.

The first part is some initialization code, which does not make any sense in the case of your simple example. This code would be removed with an optimization flag.
The last part can be mapped to C code:
movl $0, -4(%ebp) // put 0 into variable i (located at -4(%ebp))
cmpl $0, -4(%ebp) // compare variable i with value 0
jne L2 // if they are not equal, skip to after the printf call
movl $LC0, (%esp) // put the address of "testing\n" at the top of the stack
call _printf // do call printf
L2:
movl $0, %eax // return 0 (calling convention: %eax has the return code)

Well, much of it is the overhead associated with the function. main() is just a function like any other, so it has to store the return address on the stack at the start, set up the return value at the end, etc.
I would recommend using GCC to generate mixed source code and assembler which will show you the assembler generated for each sourc eline.
If you want to see the C code together with the assembly it was converted to, use a command line like this:
gcc -c -g -Wa,-a,-ad [other GCC options] foo.c > foo.lst
See http://www.delorie.com/djgpp/v2faq/faq8_20.html
On linux, just use gcc. On Windows down load Cygwin http://www.cygwin.com/
Edit - see also this question Using GCC to produce readable assembly?
and http://oprofile.sourceforge.net/doc/opannotate.html

You need some knowledge about Assembly Language to understand assembly garneted by C compiler.
This tutorial might be helpful

See here more information. You can generate the assembly code with C comments for better understanding.
gcc -g -Wa,-adhls your_c_file.c > you_asm_file.s
This should help you a little.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight