Why does main initialize stack frame when there are no variables

Why does main initialize stack frame when there are no variables - c

why does this code:
#include "stdio.h"
int main(void) {
puts("Hello, World!");
}
decide to initialize a stack frame? Here is the assembly code:
.LC0:
.string "Hello, World!"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC0
call puts
mov eax, 0
pop rbp
ret
Why does the compiler initialize a stack frame only for it to be destroyed later, withoput it ever being used? This surely wont cause any errors on the outside of the main function because I never use the stack, so I wont cause any errors. Why is it compiled this way?

Having these steps in every compiled function is the "baseline" for the compiler, unoptimized. It looks clean in disassembly, and makes sense. However, the compiler can optimize the output to reduce overhead from code that has no real effect. You can see this by compiling with different optimization levels.
What you got is like this:
.LC0:
.string "Hello, World!"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC0
call puts
mov eax, 0
pop rbp
ret
That's compiled in GCC with no optimization.
Adding the flag -O4 gives this output:
.LC0:
.string "Hello, World!"
main:
sub rsp, 8
mov edi, OFFSET FLAT:.LC0
call puts
xor eax, eax
add rsp, 8
ret
You'll notice that this still moves the stack pointer, but it skips changing the base pointer, and avoid the time-consuming memory access associated with that.
The stack is assumed to be aligned on a 16-byte boundary. With the return address having been pushed, this leaves another 8 bytes to be subtracted to get to the boundary before the function call.

It's very common for compilers to generate unoptimized code in the least complicated way possible (or at least the least complicated way that doesn't lead to code that's so bad that the optimizer won't be able to fix it) to keep the code simple and to stick to the one-responsibility principle (in the sense that making code more efficient is the optimizer's job).
Generating code to initialize the stack for all functions is less complicated than only doing so where necessary. Since the optimizer will be able to remove the unnecessary code anyway (and it will do so in more cases than a simple "does this function have any local variables?" check would), generating the unnecessary code won't have any effect as long as optimizations are enabled (and if they're not, it's expected that the generated code will contain inefficiencies).
If we did add a "does this function have any local variables?" check to the function that generates the stack-initialization code, we'd be re-inventing a less powerful version of an optimization that the optimizer already performs anyway, so we'd be violating the one-responsibility principle and increasing the complexity of the part of the compiler that could otherwise be relatively simple (as opposed to the optimizer, which is full of complicated algorithms anyway).

The stack frame makes it possible to inspect the call stack during runtime. This is useful:
when debugging
in code that relies on __builtin_frame_address(level) with level > 0
As already pointed out by others, a compiler may omit the stackframe on higher optimization levels.
See also:
How do you get gcc's __builtin_frame_address to work with -O2?

Related

CSAPP: Why use subq followed by movq when we have pushq already [duplicate]

I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though.
To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results).
Here is the forum topic about this and simular problems.
In short, I want to understand which code is better. Code like this:
sub esp, c
mov [esp+8],eax
mov [esp+4],ecx
mov [esp],edx
...
add esp, c
or code like this:
push eax
push ecx
push edx
...
add esp, c
What compiler can produce the second kind of code? They usually produce some variation of the first one.

You're right, push is a minor missed-optimization with all 4 major x86 compilers. There's some code-size, and thus indirectly performance to be had. Or maybe more directly a small amount of performance in some cases, e.g. saving a sub rsp instruction.
But if you're not careful, you can make things slower with extra stack-sync uops by mixing push with [rsp+x] addressing modes. pop doesn't sound useful, just push. As the forum thread you linked suggests, you only use this for the initial store of locals; later reloads and stores should use normal addressing modes like [rsp+8]. We're not talking about trying to avoid mov loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!
Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. (Hans Passant)
This was true 15 years ago, but compilers are once again using push when optimizing for speed, not just code-size. Compilers already use push/pop for saving/restoring call-preserved registers they want to use, like rbx, and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers). Both of these things could be done with mov, but compilers use push because it's more efficient than sub rsp,8 / mov [rsp], rbx. gcc has tuning options to avoid push/pop for these cases, enabled for -mtune=pentium3 and -mtune=pentium, and similar old CPUs, but not for modern CPUs.
Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient. Now we can use them (carefully!) when tuning for performance. See Agner Fog's microarchitecture guide and instruction tables, and his asm optimization manual. They're excellent. (And other links in the x86 tag wiki.)
It's not perfect; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs. e.g. push rax / mov [rsp-8], rdi is 3 total fused-domain uops: 2 stores and one stack-sync.
On function entry, the "stack engine" is already in a non-zero-offset state (from the call in the parent), so using some push instructions before the first direct reference to RSP costs no extra uops at all. (Unless we were tailcalled from another function with jmp, and that function didn't pop anything right before jmp.)
It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data. The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.
int extfunc(int *,int *);
void foo() {
int a=1, b=2;
extfunc(&a, &b);
}
compiles with clang6.0 -O3 -march=haswell on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them):
# compiled for the x86-64 System V calling convention:
# integer args in rdi, rsi (,rdx, rcx, r8, r9)
push rax # clang / ICC ALREADY use push instead of sub rsp,8
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1 # 6 bytes: opcode + modrm + imm32
mov rsi, rsp # special case for lea rsi, [rsp + 0]
mov dword ptr [rsi], 2
call extfunc(int*, int*)
pop rax # and POP instead of add rsp,8
ret
And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason. (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone).
clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8. (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization.
But anyway, more optimal would be:
push 2 # only 2 bytes
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1
mov rsi, rsp # special case for lea rsi, [rsp + 0]
call extfunc(int*, int*)
# ... later accesses would use [rsp] and [rsp+] if needed, not pop
pop rax # alternative to add rsp,8
ret
The push is an 8-byte store, and we overlap half of it. This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half. Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.
Note that in 64-bit mode, 32-bit push is not available. So we still have to reference rsp directly for the upper half of of the qword. But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push.
We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs. In other cases maybe you just need to spill some function args for use after a call. (Although normally compilers will push rbx and mov rbx,rdi to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.)
I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push, so we can optimize away the sub rsp, ## (or dummy push) entirely.
I could have used mov rax, 0x0000000200000001 / push rax, but 10-byte mov r64, imm64 takes 2 entries in the uop cache, and a lot of code-size.
gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov in this case. If both constants had needed 32-bit immediates, it would have made sense. But if the values weren't actually constant at all, and came from registers, this wouldn't work while push / mov [rsp+4] would. (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.)
If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub instead of multiple dummy PUSHes after the last useful PUSH. But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.
We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp] in the arg setup for the call. It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov instead of lea. (mov is definitely not worse, and better on some CPUs.)
An interesting example of not making locals contiguous is a version of the above with 3 args, int a=1, b=2, c=3;. To maintain 16B alignment, we now need to offset 8 + 16*1 = 24 bytes, so we could do
bar3:
push 3
push 2 # don't interleave mov in here; extra stack-sync uops
push 1
mov rdi, rsp
lea rsi, [rsp+8]
lea rdx, [rdi+16] # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
call extfunc3(int*,int*,int*)
add rsp, 24
ret
This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2 has to use the mov r/m32, imm32 encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov.
push imm8 is extremely compact, 2 bytes. mov dword ptr [rsp+8], 1 is 8 bytes: opcode + modrm + SIB + disp8 + imm32. (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.)
One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later. If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent. Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack, as far as I know. We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things. If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization. (IDK how hard it might be to implement a search for opportunities to use push, especially if you keep it simple and don't spend time optimizing the stack layout for it.)
However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early. So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax to store between two 32-bit values we pushed.
A tiny array of long, with non-constant contents
int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
long arr[] = {a,b,c};
ext_longarr(arr);
}
gcc/clang/ICC/MSVC follow their normal pattern, and use mov stores:
longarr_arg(long, long, long): # #longarr_arg(long, long, long)
sub rsp, 24
mov rax, rsp # this is clang being silly
mov qword ptr [rax], rdi # it could have used [rsp] for the first store at least,
mov qword ptr [rax + 8], rsi # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
mov qword ptr [rax + 16], rdx
mov rdi, rax
call ext_longarr(long*)
add rsp, 24
ret
But it could have stored an array of the args like this:
longarr_arg_handtuned:
push rdx
push rsi
push rdi # leave stack 16B-aligned
mov rsp, rdi
call ext_longarr(long*)
add rsp, 24
ret
With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack. This is a very synthetic example that does nearly nothing else. I could have used volatile int a = 1;, but some compilers treat that extra-specially.
Reasons for not building stack frames gradually
(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. So at least before making any call instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.
But that can't be right, because alloca and C99 variable-length arrays would violate that. There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.
This gcc mailing list post about disabling -maccumulate-outgoing-args for tune=default (in 2014) was interesting. It pointed out that more push/pop led to larger unwind info (.eh_frame section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code. Related: this shows what -maccumulate-outgoing-args does for gcc code-gen.
Obviously the examples I chose were trivial, where we're pushing the input parameters unmodified. More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.
If you have to spill/reload anything between function entry and later pushes, you're creating extra stack-sync uops on Intel. On AMD, it could still be a win to do push rbx / blah blah / mov [rsp-32], eax (spill to the red zone) / blah blah / push rcx / imul ecx, [rsp-24], 12345 (reload the earlier spill from what's still the red-zone, with a different offset)
Mixing push and [rsp] addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower. sub / mov is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.
"It's hard to keep track of the offsets" is a totally bogus argument. It's a computer; re-calculating offsets from a changing reference is something it has to do anyway when using push to put function args on the stack. I think compilers could run into problems (i.e. need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov store below RSP (into what's still the red-zone) before moving RSP down with future push instructions.
Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider. push wasn't as efficient before Pentium-M introduce the stack engine, so efficient push even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.
Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.

This requires disabling stack frames as well though.
It doesn't, actually. Simple stack frame initialisation can use either enter or push ebp \ mov ebp, esp \ sub esp, x (or instead of the sub, a lea esp, [ebp - x] can be used). Instead of or additionally to these, values can be pushed onto the stack to initialise the variables, or just pushing any random register to move the stack pointer without initialising to any certain value.
Here's an example (for 16-bit 8086 real/V 86 Mode) from one of my projects: https://bitbucket.org/ecm/symsnip/src/ce8591f72993fa6040296f168c15f3ad42193c14/binsrch.asm#lines-1465
save_slice_farpointer:
[...]
.main:
[...]
lframe near
lpar word, segment
lpar word, offset
lpar word, index
lenter
lvar word, orig_cx
push cx
mov cx, SYMMAIN_index_size
lvar word, index_size
push cx
lvar dword, start_pointer
push word [sym_storage.main.start + 2]
push word [sym_storage.main.start]
The lenter macro sets up (in this case) only push bp \ mov bp, sp and then lvar sets up numeric defs for offsets (from bp) to variables in the stack frame. Instead of subtracting from sp, I initialise the variables by pushing into their respective stack slots (which also reserves the stack space needed).

Optimizing a C function call using 64-bit MASM

Currently using this 64-bit MASM code to call a C runtime function such as memcmp(). I recall this convention was from a GoAsm article on optimizations.
memcmp PROTO;:QWORD,:QWORD,:QWORD
PUSH RSP
PUSH QWORD PTR [RSP]
AND SPL,0F0h
MOV R8,R11
MOV RDX,R10
MOV RCX,RAX
SUB RSP,32
CALL memcmp
LEA RSP,[RSP+40]
POP RSP
Is this a valid optimized version below?
memcmp PROTO;:QWORD,:QWORD,:QWORD
PUSH RSP
PUSH QWORD PTR [RSP]
AND RSP,-16 ; new
MOV R8,R11
MOV RDX,R10
MOV RCX,RAX
LEA RSP,[RSP-32] ; new
CALL memcmp
LEA RSP,[RSP+40]
POP RSP
The justification for replacing
AND SPL,0F0h
with
AND RSP,-16
is that it avoids invoke partial register updates. Understanding fastcall stack frame
Replacing
SUB RSP,32
with
LEA RSP,[RSP-32]
is that ensuing instructions do not depend on the flags being updated by the subtraction
then not updating the flags will be more efficient as well.
Why does GCC emit "lea" instead of "sub" for subtraction?
In this case, are there other optimization tricks too?

AND yes, the original code was silly and not saving any code-size (SPL takes a REX prefix, too, like 64-bit operand-size).
LEA - pointless and a waste of code-size: x86 CPUs already avoid false dependencies on FLAGS via register renaming; that's necessary to efficiently run normal x86 code which is full of instructions like add, sub, and, etc. Compilers would use lea much more heavily if that wasn't the case. The answer on that linked Q&A is wrong and should be downvoted / deleted. The only danger is on a few less-common CPUs (Pentium 4 and Silvermont for different reasons) from instructions like inc that only write some flags. (INC instruction vs ADD 1: Does it matter?). Even the cost of inc on Silvermont-family is pretty minor, just an extra uop but not during decode, so it doesn't stall.
add is not slower than lea on any CPUs, either itself or in its influence on later instructions. (Except in-order Atom pre-Silvermont, where lea ran earlier in the pipeline than add (on an actual AGU), so it could be better or worse depending on where data was coming from / going to). You'd only use lea in some cases like an adc loop where you actually need to keep CF unchanged so next iteration can read it. i.e. to not mess up a true dependency (RAW), nothing to do with avoiding a false (WAW) output dependency. (See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs - note that cases where adc / inc / adc creates a partial-flag stall are cases where add would cause a correctness problem, so I'm not counting that as a case where add would make later instructions faster.)
You probably don't need to save the old RSP; the ABI requires 16-byte stack alignment before a call, and that includes your caller (unless you're getting called from code that doesn't follow the ABI, so you don't have known RSP alignment relative to a 16-byte boundary).
Normally you'd just do sub rsp, 40 like a compiler would, to realign RSP and reserve space for the shadow space. (And you'd do this at the top/bottom of the function, not around every call, along with saving/restoring call-preserved registers).
(In practice memcmp is unlikely to care about stack alignment, unless it needs to save/restore some more XMM regs. The Windows x64 calling convention unwisely only has 6 call-clobbered x/ymm registers, and that might be slightly tight depending on how much loop unrolling they do in a hand-written(?) memcmp.)
And even if you did need to handle an unknown incoming RSP alignment, saving RSP to two different locations for pop rsp is still not a very efficient way to go about it. Normally you'd just use RBP to make a traditional frame pointer to clean up with mov rsp, rbp / pop rbp, which works regardless of unknown adjustment to RSP. e.g. even in functions that use alloca (or in asm, that do an unknown number of pushes or variable-sized sub rsp, which is effectively the same thing as and rsp, -16).

Locking register usage for a certain section of code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's consider a situation where we are writing in C code. When the compiler encounters a function call, my understanding is that it does the following:
Push all registers onto the stack
Jump to new function, do stuff in there
Pop old context off the stack back into the registers.
Now, some processors have 1 working register, some 32, some more than that. I'm mostly concerned with the larger number of registers. If my processor has 32 registers, the compiler will need to emit 32 push and pop instructions, just as base overhead for a function call. It would be nice if I could trade some compilation flexibility[1] in the function for less push and pop instructions. That is to say, I would like a way that I could tell the compiler "For function foo(), only use 4 registers. This would imply that the compiler would only need to push/pop 4 registers before jumping to foo().
I realize this is pretty silly to worry about on a modern PC, but I am thinking more for a low speed embedded system where you might be servicing an interrupt very quickly, or calling a simple function over and over. I also realize this could very quickly become an architecture dependant feature. Processors that use a "Source Source -> Dest" instruction set (Like ARM), as opposed to an accumulator (Like Freescale/NXP HC08) might have some lower limit on the number of registers we allow functions to use.
I do know the compiler uses tricks like inlining small functions to increase speed, and I realize I could inform most compilers to not generate the push/pop code and just hand code it myself in assembly, but my question focuses on instructing the compiler to do this from "C-Land".
My question is, are there compilers that allow this? Is this even necessary with optimizing compilers (do they already do this)?
[1] Compilation flexibility: By reducing the number of registers available to the compiler to use in a function body, you are restricting it's flexibility, and it might need to utilize the stack more since it can't just use another register.

When it comes to compilers, registers and function calls you can generally think of the registers falling into one of three categories: "hands off", volatile and non-volatile.
The "hands off" category are those that the compiler will not generally be futzing around with unless you explicitly tell it to (such as with inline assembly). These may include debugging registers and other special purpose registers. The list will vary from platform to platform.
The volatile (or scratch / call-clobbered / caller-saved) set of registers are those that a function can futz around with without the need for saving. That is, the caller understands that the contents of those registers might not be the same after the function call. Thus, if the caller has any data in those registers that it wants to keep, it must save that data before making the call and then restore it after. On a 32-bit x86 platform, these volatile registers (sometimes called scratch registers) are usually EAX, ECX and EDX.
The non-volatile (or call-preserved or callee-saved) set of registers are those that a function must save before using them and restore to their original values before returning. They only need to be saved/restored by the called function if it uses them. On a 32-bit x86 platform, these are usually the remaining general purpose registers: EBX, ESI, EDI, ESP, EBP.
Hope this helps.
(I meant to just add a small example, but quickly got carried away. I would add my own answer if this question wasn't closed, but I'm going to leave this long section here because I think it's interesting. Condense it or edit it out entirely if you don't want it in your answer -- Peter)
For a more concrete example, the SysV x86-64 ABI is well-designed (with args passed in registers, and a good balance of call-preserved vs. scratch/arg regs). There are some other links in the x86 tag wiki explaining what ABIs / calling conventions are all about.
Consider a simple example of with function calls that can't be inlined (because the definition isn't available):
int foo(int);
int bar(int a) {
return 5 * foo(a+2) + foo (a) ;
}
It compiles (on godbolt with gcc 5.3 for x86-64 with -O3 to the following:
## gcc output
# AMD64 SysV ABI: first arg in e/rdi, return value in e/rax
# the call-preserved regs used are: rbp and rbx
# the scratch regs used are: rdx. (arg-passing / return regs are not call-preserved)
push rbp # save a call-preserved reg
mov ebp, edi # stash `a` in a call-preserved reg
push rbx # save another call-preserved reg
lea edi, [rdi+2] # edi=a+2 as an arg for foo. `add edi, 2` would also work, but they're both 3 bytes and little perf difference
sub rsp, 8 # align the stack to a 16B boundary (the two pushes are 8B each, and call pushes an 8B return address, so another 8B is needed)
call foo # eax=foo(a+2)
mov edi, ebp # edi=a as an arg for foo
mov ebx, eax # stash foo(a+2) in ebx
call foo # eax=foo(a)
lea edx, [rbx+rbx*4] # edx = 5*foo(a+2), using the call-preserved register
add rsp, 8 # undo the stack offset
add eax, edx # the add between the to function-call results
pop rbx # restore the call-preserved regs we saved earlier
pop rbp
ret # return value in eax
As usual, compilers could do better: instead of stashing foo(a+2) in ebx to survive the 2nd call to foo, it could have stashed 5*foo(a+2) with a single instruction (lea ebx, [rax+rax*4]). Also, only one call-preserved register is needed, since we don't need a after the 2nd call. This removes a push/pop pair, and also the sub rsp,8 / add rsp,8 pair. (gcc bug report already filed for this missed optimization)
## Hand-optimized implementation (still ABI-compliant):
push rbx # save a call-preserved reg; also aligns the stack
lea ebx, [rdi+2] # stash ebx=a+2
call foo # eax=foo(a)
mov edi, ebx # edi=a+2 as an arg for foo
mov ebx, eax # stash foo(a) in ebx, replacing `a+2` which we don't need anymore
call foo # eax=foo(a+2)
lea eax, [rax+rax*4] #eax=5*foo(a+2)
add eax, ebx # eax=5*foo(a+2) + foo(a)
pop rbx # restore the call-preserved regs we saved earlier
ret # return value in eax
Note that the call to foo(a) happens before foo(a+2) in this version. It saved an instruction at the start (since we can pass on our arg unchanged to the first call to foo), but removed a potential saving later (since the multiply-by-5 now has to happen after the second call, and can't be combined with moving into the call-preserved register).
I could get rid of an extra mov if it was 5*foo(a) + foo(a+2). With the expression as I wrote it, I can't combine arithmetic with data movement (using lea) in every case. Or I'd need to both save a and do a separate add edi,2 before the first call.

Push all registers onto the stack
No. In the vast majority of function calls in optimized code, only a small fraction of all registers are pushed on the stack.
I'm mostly concerned with the larger number of registers.
Do you have any experimental evidence to support this concern? Is this a performance bottleneck?
I could trade some compilation flexibility[1] in the function for less
push and pop instructions.
Modern compilers use sophisticated inter-procedural register allocation. By limiting the number of registers, you will most likely degrade performance.
I realize this is pretty silly to worry about on a modern PC, but I am
thinking more for a low speed embedded system where you might be
servicing an interrupt very quickly, or calling a simple function over
and over.
This is very vague. You have to show the "simple" function, all call sites and specify the compiler and the target embedded system. You need to measure performance (compared to hand-written assembly code) to determine whether this is a problem in the first place.

How to divide disassembled C code to functions?

I have an application which creates .text segment dumps of win32 processes. Then it divides the code on basic blocks. Basic block is a set of instructions which are executed always one after another (jumps are always the last instructions of such basic blocks). Here is an example:
Basic block 1
mov ecx, dword ptr [ecx]
test ecx, ecx
je 00401013h
Basic block 2
mov eax, dword ptr [ecx]
call dword ptr [eax+08h]
Basic block 3
test eax, eax
je 0040100Ah
Basic block 4
mov edx, dword ptr [eax]
push 00000001h
mov ecx, eax
call dword ptr [edx]
Basic block 5
ret 000008h
Now I would like to group such basic blocks in functions - say which basic blocks form a function. What's the algorithm? I have to remember that there might be many ret instructions inside one function. How to detect fast_call functions?

The simplest algorithm for grouping blocks into functions would be:
note all addresses to which calls are made with call some_address instructions
if the first block after such an address ends with ret, you're done with the function, else
follow the jump in the block to another block and so on until you've followed all possible execution paths (remember about conditional jumps, each of which splits a path into two) and all the paths have finished with ret. You'll need to recognize jumps that organize loops so your program itself does not hang by entering an infinite loop
Problems:
a number of calls can be made indirectly by reading function pointers from memory, e.g. you'd have call [some_address] instead of call some_address
some indirect calls can be made to calculated addresses
functions that call other functions before returning may have jump some_address instead of call some_address immediately followed by ret
call some_address can be simulated with a combination of push some_address + ret OR push some_address + jmp some_other_address
some functions may share code at their end (e.g. they have different entry points, but one or more exit points are the same)
You may use some heuristic to determine where functions start by looking for the most common prolog instruction sequence:
push ebp
mov ebp, esp
Again, this may not work if functions are compiled with the frame pointer suppressed (i.e. they'd use esp instead of ebp to access their parameters on the stack, it's possible).
The compiler (e.g. MSVC++) may also pad the inter-function space with the int 3 instruction and that too can serve as a hint for an upcoming function beginning.
As for differentiating between the various calling conventions, it's perhaps the easiest to look at the symbols (of course, if you have them). MSVC++ generates different name prefixes and suffixes, e.g.:
_function - cdecl
_function#number - stdcall
#function#number - fastcall
If you cannot extract this information from the symbols, you must analyze code to see how parameters are passed to functions and whether functions or their callers remove them from the stack.

You could use the presence of enter to denote the beginning of a function, or certain code which sets up a frame.
push ebp
mov ebp, esp
sub esp, (bytes for "local" stack space)
Later you'll find the opposite code (or leave) before a call to ret:
mov esp, ebp
pop ebp
You can also use the number of bytes for local stack space to identify local variables.
Identifying thiscall, fastcall, etc, will take some analysis of the code just prior to calls which use the initial location and an evaluation of the registers used/cleaned up.

Have a look at software like windasm or ollydbg. The call and ret operations denote function calls. However code does not run sequentially and jumps can be made all over the place. call dword ptr [edx] depends on the edx register and thus you won't be able to know where it goes unless you do runtime debugging.
To recognize fastcall functions you have to look at how parameters are passed on. Fastcall will put the first two pointer sized parameters in edx and ecx registers, where stdcall will push them on the stack. See this article for an explanation.

stack operations on a basic C program

Im disassembling this basic C code, trying to figure out what operations
are done on the stack. Im doing in it on a vm, 32 bit, gcc 4.4.3, ubuntu based
distro. I compiled the code with this flags.
gcc -ggdb -mpreferred-stack-boundary=2 -fno-stack-protector -o ExploitMe ExploitMe.c
#include<stdio.h>
#include<string.h>
main(int argc, char **argv)
{
char buffer[80];
strcpy(buffer, argv[1]);
return 1;
}
The problems is that i cannot figure out why on operation 3, the stack
pointer is moved 0x58, the char is 80 characters long, shouldnt it be 0x50 ?
dump of assembler code for function main:
0x080483e4 <+0>: push %ebp
0x080483e5 <+1>: mov %esp,%ebp
=> 0x080483e7 <+3>: sub $0x58,%esp
0x080483ea <+6>: mov 0xc(%ebp),%eax
0x080483ed <+9>: add $0x4,%eax
0x080483f0 <+12>:mov (%eax),%eax
0x080483f2 <+14>:mov %eax,0x4(%esp)
0x080483f6 <+18>:lea -0x50(%ebp),%eax
0x080483f9 <+21>:mov %eax,(%esp)
0x080483fc <+24>:call 0x804831c <strcpy#plt>
0x08048401 <+29>:mov $0x1,%eax
0x08048406 <+34>:leave
0x08048407 <+35>:ret
End of assembler dump.
Im stuck on it, i see later that is taking the exected lenght but what
is the program making between those ops ?¿
0x080483f6 <+18>:lea -0x50(%ebp),%eax
Thank you

The compiler is free to arrange the stack however it sees fit.

The other 8 bytes are for the arguments to strcpy. Rather than push them on to the stack, the compiler has realised that it can simply subtract an extra 8 bytes from the stack pointer and then store the registers to memory. This means that the stack pointer only has to be adjusted once.

it is probably allocating a couple more locations for storing the passed in parameters (argv, argc). and/or it needs some more local storage. Compilers do whatever they want to implement the high level code, the same code will produce dozens/hundreds of different assembly langauge sequences depending on the compiler, version, and optimization settings as well as configure/build settings when the compiler itself was compiled.
You often see this sort of a stack frame though and usually due to a combination of performance and instruction set features/limitations. Much easier to code and debug if you move the stack pointer once or make a copy of it with another register, within the function everything is referenced to one static point while the prepparing, calling, and cleaning up of functions messes with the real stack pointer.
You will often also see that the stack frame leaves room for the passed in parameters and other local variables even if optimization has removed the need for those variables to actually spend any time on the stack. Up front the need for a stack frame and size is determined and optimization comes later and the compiler doesnt always go back and realize that if it makes another pass on the function it can make the stack frame smaller. Likewise the compiler writer can more easily debug if they know that their stack frame always starts with passed in parameters then the local variables in order, very fast and easy to read and debug the code, just an example.
Bottom line though is Oli's answer, the compiler can do whatever it wants so long as it implements your code. My extension to that is the output from the same high level code varies widely depending on the compiler and options. And it is rarely perfectly optimized.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight