GCC compiler gives different results for Windows and Linux? [duplicate] - c

I wrote a simple code on a 64 bit machine
int main() {
printf("%d", 2.443);
}
So, this is how the compiler will behave. It will identify the second argument to be a double hence it will push 8 bytes on the stack or possibly just use registers across calls to access the variables. %d expects a 4 byte integer value, hence it prints some garbage value.
What is interesting is that the value printed changes everytime I execute this program. So what is happening? I expected it to print the same garbage value everytime not different ones everytime.

It's undefined behaviour, of course, to pass arguments not corresponding to the format, so the language cannot tell us why the output changes. We must look at the implementation, what code it produces, and possibly the operating system too.
My setup is different from yours,
Linux 3.1.10-1.16-desktop x86_64 GNU/Linux (openSuSE 12.1)
with gcc-4.6.2. But it's similar enough that it's reasonable to suspect the same mechanisms.
Looking at the generated assembly (-O3, out of habit), the relevant part (main) is
.cfi_startproc
subq $8, %rsp # adjust stack pointer
.cfi_def_cfa_offset 16
movl $.LC1, %edi # move format string to edi
movl $1, %eax # move 1 to eax, seems to be the number of double arguments
movsd .LC0(%rip), %xmm0 # move the double to the floating point register
call printf
xorl %eax, %eax # clear eax (return 0)
addq $8, %rsp # adjust stack pointer
.cfi_def_cfa_offset 8
ret # return
If instead of the double, I pass an int, not much changes, but that significantly
movl $47, %esi # move int to esi
movl $.LC0, %edi # format string
xorl %eax, %eax # clear eax
call printf
I have looked at the generated code for many variations of types and count of arguments passed to printf, and consistently, the first double (or promoted float) arguments are passed in xmmN, N = 0, 1, 2, and the integer (int, char, long, regardless of signedness) are passed in esi, edx, ecx, r8d, r9d and then the stack.
So I venture the guess that printf looks for the announced int in esi, and prints whatever happens to be there.
Whether the contents of esi are in any way predictable when nothing is moved there in main, and what they might signify, I have no idea.

This answer attempts to address some of the sources of variation. It is a follow-up to Daniel Fischer’s answer and some comments to it.
As I do not work with Linux, I cannot give a definitive answer. For a printf later in a large application, there would be a myriad of sources of potential variation. This early in a small application, there should be only a few.
Address space layout randomization (ASLR) is one: The operating system deliberately rearranges some memory randomly to prevent malware for knowing what addresses to use. I do not know if Linux 3.4.4-2 has this.
Another is environment variables. Your shell environment variables are copied into processes it spawns (and accessible through the getenv routine). A few of those might change automatically, so they would have slightly different values. This is unlikely to directly affect what printf sees when it attempts to use a missing integer argument, but there could be cascading effects.
There may be a shared-library loader that runs either before main is called or before printf is called. For example, if printf is in a shared library, rather than built into your executable file, then a call to printf likely actually results in a call to a stub routine that calls the loader. The loader looks up the shared library, finds the module containing printf, loads that module into your process’ address space, changes the stub so that it calls the newly loaded printf directly in the future (instead of calling the loader), and calls printf. As you can imagine, that can be a fairly extensive process and involves, among other things, finding and reading files on disk (all the directories to get to the shared library and the shared library). It is conceivable that some caching or file operations on your system result in slightly different behavior in the loader.
So far, I favor ASLR as the most likely candidate of the ones above. The latter two are likely to be fairly stable; the values involved would usually change occasionally, not frequently. ASLR would change each time, and simply leaving an address in a register would suffice to explain the printf behavior.
Here is an experiment: After the initial printf, insert another printf with this code:
printf("%d\n", 2.443);
int a;
printf("%p\n", (void *) &a);
The second printf prints the address of a, which is likely on the stack. Run the program two or three times and calculate the difference between the value printed by the first printf and the value printed by the second printf. (The second printf is likely to print in hexadecimal, so it might be convenient to change the first to "%x" to make it hexadecimal too.) If the value printed by the second printf varies from run to run, then your program is experiencing ASLR. If the values change from run to run but the difference between them remains constant, then the value that printf has happened upon in the first printf is some address in your process that was left lying around after program initialization.
If the address of a changes but the difference does not remain constant, you might try changing int a; to static int a; to see if comparing the first value to different part of your address space yields a better result.
Naturally, none of this is useful for writing reliable programs; it is just educational with regard to how program loading and initialization works.

Related

Dynamic allocation of structure array inside another structure [duplicate]

I wrote a simple code on a 64 bit machine
int main() {
printf("%d", 2.443);
}
So, this is how the compiler will behave. It will identify the second argument to be a double hence it will push 8 bytes on the stack or possibly just use registers across calls to access the variables. %d expects a 4 byte integer value, hence it prints some garbage value.
What is interesting is that the value printed changes everytime I execute this program. So what is happening? I expected it to print the same garbage value everytime not different ones everytime.
It's undefined behaviour, of course, to pass arguments not corresponding to the format, so the language cannot tell us why the output changes. We must look at the implementation, what code it produces, and possibly the operating system too.
My setup is different from yours,
Linux 3.1.10-1.16-desktop x86_64 GNU/Linux (openSuSE 12.1)
with gcc-4.6.2. But it's similar enough that it's reasonable to suspect the same mechanisms.
Looking at the generated assembly (-O3, out of habit), the relevant part (main) is
.cfi_startproc
subq $8, %rsp # adjust stack pointer
.cfi_def_cfa_offset 16
movl $.LC1, %edi # move format string to edi
movl $1, %eax # move 1 to eax, seems to be the number of double arguments
movsd .LC0(%rip), %xmm0 # move the double to the floating point register
call printf
xorl %eax, %eax # clear eax (return 0)
addq $8, %rsp # adjust stack pointer
.cfi_def_cfa_offset 8
ret # return
If instead of the double, I pass an int, not much changes, but that significantly
movl $47, %esi # move int to esi
movl $.LC0, %edi # format string
xorl %eax, %eax # clear eax
call printf
I have looked at the generated code for many variations of types and count of arguments passed to printf, and consistently, the first double (or promoted float) arguments are passed in xmmN, N = 0, 1, 2, and the integer (int, char, long, regardless of signedness) are passed in esi, edx, ecx, r8d, r9d and then the stack.
So I venture the guess that printf looks for the announced int in esi, and prints whatever happens to be there.
Whether the contents of esi are in any way predictable when nothing is moved there in main, and what they might signify, I have no idea.
This answer attempts to address some of the sources of variation. It is a follow-up to Daniel Fischer’s answer and some comments to it.
As I do not work with Linux, I cannot give a definitive answer. For a printf later in a large application, there would be a myriad of sources of potential variation. This early in a small application, there should be only a few.
Address space layout randomization (ASLR) is one: The operating system deliberately rearranges some memory randomly to prevent malware for knowing what addresses to use. I do not know if Linux 3.4.4-2 has this.
Another is environment variables. Your shell environment variables are copied into processes it spawns (and accessible through the getenv routine). A few of those might change automatically, so they would have slightly different values. This is unlikely to directly affect what printf sees when it attempts to use a missing integer argument, but there could be cascading effects.
There may be a shared-library loader that runs either before main is called or before printf is called. For example, if printf is in a shared library, rather than built into your executable file, then a call to printf likely actually results in a call to a stub routine that calls the loader. The loader looks up the shared library, finds the module containing printf, loads that module into your process’ address space, changes the stub so that it calls the newly loaded printf directly in the future (instead of calling the loader), and calls printf. As you can imagine, that can be a fairly extensive process and involves, among other things, finding and reading files on disk (all the directories to get to the shared library and the shared library). It is conceivable that some caching or file operations on your system result in slightly different behavior in the loader.
So far, I favor ASLR as the most likely candidate of the ones above. The latter two are likely to be fairly stable; the values involved would usually change occasionally, not frequently. ASLR would change each time, and simply leaving an address in a register would suffice to explain the printf behavior.
Here is an experiment: After the initial printf, insert another printf with this code:
printf("%d\n", 2.443);
int a;
printf("%p\n", (void *) &a);
The second printf prints the address of a, which is likely on the stack. Run the program two or three times and calculate the difference between the value printed by the first printf and the value printed by the second printf. (The second printf is likely to print in hexadecimal, so it might be convenient to change the first to "%x" to make it hexadecimal too.) If the value printed by the second printf varies from run to run, then your program is experiencing ASLR. If the values change from run to run but the difference between them remains constant, then the value that printf has happened upon in the first printf is some address in your process that was left lying around after program initialization.
If the address of a changes but the difference does not remain constant, you might try changing int a; to static int a; to see if comparing the first value to different part of your address space yields a better result.
Naturally, none of this is useful for writing reliable programs; it is just educational with regard to how program loading and initialization works.

Why does this assembly code segfaults upon printf [duplicate]

What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax
The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.
Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.

ARMv8 illegal instruction [duplicate]

What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax
The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.
Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.

How does argument passing work?

I want to know how passing arguments to functions in C works. Where are the values being stored and how and they retrieved? How does variadic argument passing work? Also since it's related: what about return values?
I have a basic understanding of CPU registers and assembler, but not enough that I thoroughly understand the ASM that GCC spits back at me. Some simple annotated examples would be much appreciated.
Considering this code:
int foo (int a, int b) {
return a + b;
}
int main (void) {
foo(3, 5);
return 0;
}
Compiling it with gcc foo.c -S gives the assembly output:
foo:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %eax
movl 8(%ebp), %edx
leal (%edx,%eax), %eax
popl %ebp
ret
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $5, 4(%esp)
movl $3, (%esp)
call foo
movl $0, %eax
leave
ret
So basically the caller (in this case main) first allocates 8 bytes on the stack to accomodate the two arguments, then puts the two arguments on the stack at the corresponding offsets (4 and 0), and then the call instruction is issued which transfers the control to the foo routine. The foo routine reads its arguments from the corresponding offsets at the stack, restores it, and puts its return value in the eax register so it's available to the caller.
That is platform specific and part of the "ABI". In fact, some compilers even allow you to choose between different conventions.
Microsoft's Visual Studio, for example, offers the __fastcall calling convention, which uses registers. Other platforms or calling conventions use the stack exclusively.
Variadic arguments work in a very similar way - they are passed via registers or stack. In case of registers, they are usually in ascending order, based on type. If you have something like (int a, int b, float c, int d), a PowerPC ABI might put a in r3, b in r4, d in r5, and c in fp1 (I forgot where float registers start, but you get the idea).
Return values, again, work the same way.
Unfortunately, I don't have many examples, most of my assembly is in PowerPC, and all you see in the assembly is the code going straight for r3, r4, r5, and placing the return value in r3 as well.
Your questions are more than anybody could reasonably try to answer in a SO post, not to mention that it's implementation defined as well.
However, if you're interested in the x86 answer might I suggest you watch this Stanford CS107 Lecture titled Programming Paradigms where all the answers to the questions you posed will be explained in great detail (and quite eloquently) in the first 6-8 lectures.
It depends on your compiler, the target architecture and OS you’re compiling for, and whether your compiler supports non-standard extensions that change the calling convention. But there are some commonalities.
The C calling convention is usually established by the vendor of the operating system, because they need to decide what convention the system libraries use.
More recent CPUs (such as ARM or PowerPC) tend to have their calling conventions defined by the CPU vendor and compatible across different operating systems. x86 is an exception to this: different systems use different calling conventions. There used to be a lot more calling conventions for the 16-bit 8086 and 32-bit 80386 than there are for x86_64 (although even that is not down to one). 32-bit x86 Windows programs sometimes use multiple calling conventions within the same program.
Some observations:
An example of an operating system that supports several different ABIs with different calling conventions simultaneously, some of which follow the same conventions as other OSes for the same architecture, is Linux for x86_64. This can host three different major ABIs (i386, x32 and x86_64), two of which are the same as other operating systems for the same CPU, and several variants.
An exception to the rule that there's one system calling convention used for everything is 16- and 32-bit versions of MS Windows, which inherited some of the proliferation of calling conventions from MS-DOS. The Windows C API uses a different calling convention (STDCALL, originally FAR PASCAL) than the “C” calling convention for the same platform, and also supports FORTRAN and FASTCALL conventions. All four come in NEAR and FAR variants on 16-bit OSes. Nearly all Windows programs therefore use at least two different conventions in the same program.
Architectures with a lot of registers, including classic RISC and nearly all modern ISAs, use several of those registers to pass and return function arguments.
Architectures with few or no general-purpose registers often pass arguments on the stack, pointed to by a stack pointer. CISC architectures often have instructions to call and return which store the return address on the stack. (RISC architectures typically store the return address in a "link register", which the callee can save/restore manually if it's not a leaf function.)
A common variant is for tail calls, functions whose return value is also the return value of the caller, to jump to the next function (so it returns to our parent function) instead of calling it and then returning after it returns. Placing args in the right places has to account for the return address already being on the stack, where a call instruction would place it.
This is especially true of tail-recursive calls, which have exactly the same stack frame on each invocation. A tail-recursive call is typically equivalent to a loop: update a few registers that changed, then jump back to the entry point. They do not need to create a new stack frame, or have their own return address: you can simply update the caller’s stack frame and use its return address as the tail call’s. i.e. tail-recursion easily optimizes into a loop.
Some architectures with only a few registers nevertheless defined an alternative calling convention that could pass one or two arguments in registers. This was FASTCALL on MS-DOS and Windows.
A few older ISAs, such as SPARC, had a special bank of “windowed” registers, so that every function has its own bank of input and output registers, and when it made a function call, the caller’s outputs became the callee’s inputs, and the reverse when it came time to return a value. Modern superscalar designs consider this more trouble than it’s worth.
A few very old architectures used self-modifying code in their calling convention, and the first edition of The Art of Computer Programming followed this model for its abstract language. It no longer works on most modern CPUs, which have instruction caches.
A few other very old architectures had no stack and generally could not call the same function again, re-entering it, until it returned.
A function with a lot of arguments almost always puts most of them onto the stack.
C functions that put arguments on the stack almost have to push them in reverse order and have the caller clean up the stack. The called function might not even know exactly how many arguments are on the stack! That is, if you call printf("%d\n", x); the compiler will push x, then the format string, then the return address, onto the stack. This guarantees that the first argument is at a known offset from the stack pointer and <varargs.h> has the information it needs to work.
Most other languages, and therefore some operating systems that C compilers support, do it the other way around: arguments are pushed from left to right. The function being called usually cleans up its own stack frame. This used to be called the PASCAL convention on MS-DOS, and survives as the STDCALL convention on Windows. It cannot support variadic functions. (https://en.wikibooks.org/wiki/X86_Disassembly/Calling_Conventions)
Fortran and a few other language historically passed all arguments by reference, which translates to C as pointer arguments. Compilers that might need to interface with these other languages often support these foreign calling conventions.
Because a major source of bugs was “smashing the stack,” many compilers now have a way to add canary values (which, like a canary in a coal mine, warn you that something dangerous is going on if anything happens to them) and other means of detecting when code tampers with the stack frame.
Another form of variation across different platforms is whether the stack frame will contain all the information it needs for a debugger or exception-handler to backtrace, or whether that info will be in separate metadata (or not present at all) allowing simplification of function prologue/epilogue (-fomit-frame-pointer).
You can get cross-compilers to emit code using different calling conventions, and compare them, with switches such as -S -target (on clang).
Basically, C passes arguments by pushing them on the stack. For pointer types, the pointer is pushed on the stack.
One things about C is that the caller restores the stack rather the function being called. This way, the number of arguments can vary and the called function doesn't need to know ahead of time how many arguments will be passed.
Return values are returned in the AX register, or variations thereof.

Why does gcc use movl instead of push to pass function args?

pay attention to this code :
#include <stdio.h>
void a(int a, int b, int c)
{
char buffer1[5];
char buffer2[10];
}
int main()
{
a(1,2,3);
}
after that :
gcc -S a.c
that command shows our source code in assembly.
now we can see in the main function, we never use "push" command to push the arguments of
the a function into the stack. and it used "movel" instead of that
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $3, 8(%esp)
movl $2, 4(%esp)
movl $1, (%esp)
call a
leave
why does it happen?
what's difference between them?
Here is what the gcc manual has to say about it:
-mpush-args
-mno-push-args
Use PUSH operations to store outgoing parameters. This method is shorter and usually
equally fast as method using SUB/MOV operations and is enabled by default.
In some cases disabling it may improve performance because of improved scheduling
and reduced dependencies.
-maccumulate-outgoing-args
If enabled, the maximum amount of space required for outgoing arguments will be
computed in the function prologue. This is faster on most modern CPUs because of
reduced dependencies, improved scheduling and reduced stack usage when preferred
stack boundary is not equal to 2. The drawback is a notable increase in code size.
This switch implies -mno-push-args.
Apparently -maccumulate-outgoing-args is enabled by default, overriding -mpush-args. Explicitly compiling with -mno-accumulate-outgoing-args does revert to the PUSH method, here.
2019 update: modern CPUs have had efficient push/pop since about Pentium M.
-mno-accumulate-outgoing-args (and using push) eventually became the default for -mtune=generic in Jan 2014.
That code is just directly putting the constants (1, 2, 3) at offset positions from the (updated) stack pointer (esp). The compiler is choosing to do the "push" manually with the same result.
"push" both sets the data and updates the stack pointer. In this case, the compiler is reducing that to only one update of the stack pointer (vs. three). An interesting experiment would be to try changing function "a" to take only one argument, and see if the instruction pattern changes.
gcc does all sorts of optimizations, including selecting instructions based upon execution speed of the particular CPU being optimized for. You will notice that things like x *= n is often replaced by a mix of SHL, ADD and/or SUB, especially when n is a constant; while MUL is only used when the average runtime (and cache/etc. footprints) of the combination of SHL-ADD-SUB would exceed that of MUL, or n is not a constant (and thus using loops with shl-add-sub would come costlier).
In case of function arguments: MOV can be parallelized by hardware, while PUSH cannot. (The second PUSH has to wait for the first PUSH to finish because of the update of the esp register.) In case of function arguments, MOVs can be run in parallel.
Is this on OS X by any chance? I read somewhere that it requires the stack pointer to be aligned at 16-byte boundaries. That could possibly explain this kind of code generation.
I found the article: http://blogs.embarcadero.com/eboling/2009/05/20/5607

Resources