Fast fibers/coroutines under x64 Windows - c

So I have this coroutine API, extended by me, based on code I found here: https://the8bitpimp.wordpress.com/2014/10/21/coroutines-x64-and-visual-studio/
struct mcontext {
U64 regs[8];
U64 stack_pointer;
U64 return_address;
U64 coroutine_return_address;
};
struct costate {
struct mcontext callee;
struct mcontext caller;
U32 state;
};
void coprepare(struct costate **token,
void *stack, U64 stack_size, cofunc_t func); /* C code */
void coenter(struct costate *token, void *arg); /* ASM code */
void coyield(struct costate *token); /* ASM code */
int coresume(struct costate *token); /* ASM code, new */
I'm stuck on implementing coyield(). coyield() can be written in C, but it's the assembly that I'm having problems with. Here's what I got so far (MASM/VC++ syntax).
;;; function: void _yield(struct mcontext *callee, struct mcontext *caller)
;;; arg0(RCX): callee token
;;; arg2(RDX): caller token
_yield proc
lea RBP, [RCX + 64 * 8]
mov [RCX + 0], R15
mov [RCX + 8], R14
mov [RCX + 16], R13
mov [RCX + 24], R12
mov [RCX + 32], RSI
mov [RCX + 40], RDI
mov [RCX + 48], RBP
mov [RCX + 56], RBX
mov R11, RSP
mov RSP, [RDX + 64]
mov [RDX + 64], R11
mov R15, [RDX + 0]
mov R14, [RDX + 8]
mov R13, [RDX + 16]
mov R12, [RDX + 24]
mov RSI, [RDX + 32]
mov RDI, [RDX + 40]
mov RBP, [RDX + 48]
mov RBX, [RDX + 56]
ret
_yield endp
This is a straight forward adaption of 8bitpimp's code. What it doesn't do, if I understand this code correctly, is put mcontext->return_address and mcontext->coroutine_return_address on the stack to be popped by the ret. Also, is that fast? IIRC, it causes a mismatch on the return branch predictor found in modern x64 pieces.

This answers only addresses the "is it fast" part of the question.
Return Address Prediction
First, a brief description of the behavior of a typical return-address predictor.
Every time a call is made, the return address that is pushed on the actual stack is also stored inside a CPU structure called the return address buffer or something like that.
When a ret (return) is made, the CPU assumes the destination will be the address currently at the top of the return address buffer, and that entry from return address buffer is "popped".
The effect is to perfectly1 predict call/ret pairs, as long as they occur in their usual properly nested pattern and that ret is actually removing the unmodified return address pushed by call in each case. For more details you can start here.
Normal function calls in C or C++ (or pretty much any other language) will generally always follow this properly nested pattern2. So you don't need to do anything special to take advantage of the return prediction.
Failure Modes
In cases where call/ret aren't paired up normally, the predictions can fail in (at least) a couple of different ways:
If the stack pointer or the return value on the stack is manipulated so that a ret doesn't return the place that the corresponding call pushed, you'll get a branch target prediction failure for that ret, but subsequent normally nested ret instructions will continue to predict correctly as long as they are correctly nested. For example, if at function you add a few bytes to the value at [rsp] in order to skip over the instruction following the call in the calling function, the next ret will mispredict, but the ret that follows inside the calling function should be fine.
On the other hand, the call and ret functions aren't properly nested, the whole return prediction buffer can become misaligned, causing future ret instructions, if any, that use the existing values to mispredict2.5. For example, if you call into a function, but then use jmp to return to the caller, there is a mismatched call without a ret. The ret inside the caller will mispredict, and so will the ret inside the caller of the caller, and so on, until all misaligned values are used up or overwritten3. A similar case would occur if you had a ret not matched with a corresponding call (and this case is important for the subsequent analysis).
Rather than the two rules above , you can also simply determine the behavior of the return predictor by tracing through the code and tracking what the return stack looks like at each point. Every time you have a ret instruction, see if it returns to the current top of the return stack - if not, you'll get a misprediction.
Misprediction Cost
The actual cost of a misprediction depends on the surrounding code. A figure of ~20 cycles is commonly given and is often seen in practice, but the actual cost can be lower: e.g., as low as zero if the CPU is able to resolve the misprediction early and and start fetching along the new path without interrupting the critical path, or higher: e.g., if the branch prediction failures take a long time to resolve and reduce the effective parallelism of long-latency operations. Regardless we can say that the penalty is usually significant when it occurs in an operation that other takes only a handful of instructions.
Fast Coroutines
Existing Behavior for Coresume and Coyield
The existing _yield (context switch) function swaps the stack pointer rsp and then uses ret to return to a different location than what the actually caller pushed (in particular, it returns to location that was pushed onto the caller stack when the caller called yield earlier). This will generally cause a misprediction at the ret inside _yield.
For example, consider the case where some function A0 makes a normal function call to A1, which it turn calls coresume4 to resume a coroutine B1, which later calls coyield to yield back to A1. Inside the call to coresume, the return stack looks like A0, A1, but then coresume swaps rsp to point to the stack for B1 and the top value of that stack is an address inside B1 immediately following coyield in the code for B1. The ret inside coresume hence jumps to a point in B1, and not to a point in A1 as the return stack expects. Hence you get a mis-prediction on that ret and the return stack looks like A0.
Now consider what happens when B1 calls coyield, which is implemented in basically the same way coresume: the call to coyield pushes B1 on the return stack which now looks like A0, B1 and then swaps the stack to point to A1 stack and then does the ret which will return to A1. So the ret mispredicition will happen in the same way, and the stack is left as A0.
So the bad news is that a tight series of calls to coresume and coyield (as is typical with a yield-based iterator, for example), will mispredict each time. The good news is that now inside A1 at least the return stack is correct (not misaligned) - if A1 returns to its caller A0, the return is correctly predicted (and so on when A0 returns to its caller, etc). So you suffer a mispredict penalty each time, but at least you don't misalign the return stack in this scenario. The relative importance of this depends on how often you are calling coresume/coyield versus calling functions normally in the below the function that is calling coresume.
Making It Fast
So can we fix the misprediction? Unfortunately, it's tricky in the combination of C and external ASM calls, because making the call to coresume or coyield implies a call inserted by the compiler, and it's hard to unwind this in the asm.
Still, let's try.
Use Indirect Calls
One approach is get of using ret at all and just use indirect jumps.
That is, just replace the ret at the end of your coresume and coyield calls with:
pop r11
jmp r11
This is functionally equivalent to ret, but affects the return stack buffer differently (in particular, it doesn't affect it).
If analyze the repeated sequence of coresume and coyield calls as above, we get the result that the return stack buffer just starts growing indefinitely like A0, A1, B1, A1, B1, .... This occurs because in fact we aren't using the ret at all in this implementation. So we don't suffer return mis-predictions, because we aren't using ret! Instead, we rely on the accuracy of the indirect branch predictor to predict jmp11.
How that predictor works depends on how coresume and coyeild are implemented. If they both call a shared _yield function that isn't inlined there is only a single jmp r11 location and this jmp will alternately go to a location in A1 and B1. Most modern indirect predictors will repredict this simple repeating pattern fine, although older ones which only tracked a single location will not. If _yield got inlined into coresume and coyield or you just copy-pasted the code into each function, there are two distinct jmp r11 call sites, each which only see a single location each, and should be well-predicted by any CPU with an indirect branch predictor6.
So this should generally predict a series of tight coyield and coresume calls well7, but at the cost of obliterating the return buffer, so when A1 decides to return to A0 this will be mispredicted as well as subsequent returns by A0 and so on. The size of this penalty is bounded above by the size of the return stack buffer, so if you are making many tight coresume/yield calls this may be a good tradeoff.
That's the best I can think of within the constraint of external calls to functions written in ASM, because you already have an implied call for your co routines, and you have to make the jump to the other couroutine from inside there and I can't see how to keep the stacks balanced and return to the correct location with those constraints.
Inlined Code at the Call Site
If you can inline code at the call-site of your couroutine methods (e.g., with compiler support or inline asm), then you can perhaps do better.
The call to coresume could be inlined as something like this (I've omitted the register saving and restoring code because that's straightforward):
; rcx - current context
; rdc - context for coroutine we are about to resume
; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [rsp - 8]
mov [rcx + 64], r11 ; save current stack pointer
mov r11, [rdx + 64] ; load dest stack pointer
call [r11]
Note that coresume doens't actually do the stack swap - it just loads the destination stack into r11 and then does a call against [r11] to jump to the coroutine. This is necessary so that that call correctly pushes location we should return to on the stack of the caller.
Then, coyield would look something like (inlined into the calling function):
; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [after_ret]
push r11 ; save the return point on the stack
mov rsp, [rdx + 64] ; load the destination stack
ret
after_ret:
mov rsp, r11
When a coresume call jumps to the coroutine it ends up at after_ret, and before executing the user code the mov rsp, r11 instruction swaps to the proper stack for the coroutine which has been stashed in r11 by coresume.
So essentially coyield has two parts: the top half executed before the yield (which occurs at the ret call) and the bottom half which completes the work started by coresume. This allows you to use call as the mechanism to do the coresume jump and ret to do the coyield jump. The call/ret are balanced in this case.
I've glossed over some details of this approach: for example, since there is no function call involved, the ABI-specified non-volatile registers aren't really special: in the case of inline assembly you'll need to indicate to the compiler which variables you will clobber and save the rest, but you can choose whatever set is convenient for you. Choosing a larger set of clobbered variables makes the coresume/coyield code sequences themselves shorter, but potentially puts more register pressure on the surrounding code and may force the compiler to spill more surrounding you code. Perhaps the ideal is just to declare everything clobbered and then the compiler will just spill what it needs.
1 Of course, there are limitations in practice: the size of the return stack buffer is likely limited to some small number (e.g., 16 or 24) so once the depth of the call stack exceeds that, some return addresses are lost and won't be correctly predicted. Also, various events like a context switch or interrupt are likely to mess up the return-stack predictor.
2 An interesting exception was a common pattern for reading the current instruction pointer in x86 (32-bit) code: there is no instruction to do this directly, so instead a call next; next: pop rax sequence can be used: a call to the next instruction which serves only the push the address on the stack which is popped off. There is no corresponding ret. Current CPUs actually recognize this pattern however and don't unbalance the return-address predictor in this special case.
2.5 How many mispredictions this implies depends on how may net returns the calling function does: if it immediately starts calling down another deep chain of calls, the misaligned return stack entries may never be used at all, for example.
3 Or, perhaps, until the return address stack is re-aligned by a ret without a corresponding call, a case of "two wrongs make a right".
4 You haven't actually shown how coyield and coresume actually call _yield, so for the rest of the question I'll assume that they are implemented essentially as _yield is, directly within coyield or coresume without calling _yield: i.e., copy and paste the _yield code into each function, possible with some small edits to account for the difference. You can also make this work by calling _yield, but then you have an additional layer of calls and rets that complicates the analysis.
5 To the extent these terms even make sense in a symmetric couroutine implementation, since there is in fact no absolute notion of caller and callee in that case.
6 Of course, this analysis applies only to the simple case that you have a single coresume call calling into a coroutine with a single coyield call. More complex scenarios are possible, such as multiple coyield calls inside the callee, or multiple coresume calls inside the caller (possibly to different couroutines). However, the same pattern applies: the case with split jmp r11 sites will present a simpler steam than the combined case (possibly at the cost of more iBTB resources).
7 One exception would be the first call or two: the ret predictor needs no "warmup" but the indirect branch predictor may, especially when another coroutine has been called in the interim.

Related

Segfault pushing to stack in C inline assembly

I am having an issue with some inline assembly. I am writing a compiler, and it is compiling to assembly, and for portability i made it add the main function in C and just use inline assembly. Though even the simplest inline assembly is giving me a segfault. Thanks for your help
int main(int argc, char** argv) {
__asm__(
"push $1\n"
);
return 0;
}
TLDR at bottom. Note: everything here is assuming x86_64.
The issue here is that compilers will effectively never use push or pop in a function body (except for prologues/epilogues).
Consider this example.
When the function begins, room is made on the stack in the prologue with:
push rbp
mov rbp, rsp
sub rsp, 32
This creates 32 bytes of room for main. Then notice how throughout the function, instead of pushing items to the stack, they are mov'd to the stack through offsets from rbp:
mov DWORD PTR [rbp-20], edi
mov QWORD PTR [rbp-32], rsi
mov DWORD PTR [rbp-4], 2
mov DWORD PTR [rbp-8], 5
The reason for this is it allows for variables to be stored anywhere at anytime, and loaded from anywhere at anytime without requiring a huge amount of push/pops.
Consider the case where variables are stored using push and pop. Say a variable is stored early on in the function, let's call this foo. 8 variables on the stack later, you need foo, how should you access it?
Well, you can pop everything until foo, and then push everything back, but that's costly.
It also doesn't work when you have conditional statements. Say a variable is only ever stored if foo is some certain value. Now you have a conditional where the stack pointer could be at one of two locations after it!
For this reason, compilers always prefer to use rbp - N to store variables, as at any point in the function, the variable will still live at rbp - N.
NB: On different ABIs (such as i386 system V), parameters to arguments may be passed on the stack, but this isn't too much of an issue, as ABIs will generally specify how this should be handled. Again, using i386 system V as an example, the calling convention for a function will go something like:
push edi ; 2nd argument to the function.
push eax ; 1st argument to the function.
call my_func
; here, it can be assumed that the stack has been corrected
So, why does push actually cause an issue?
Well, I'll add a small asm snippet to the code
At the end of the function, we now have the following:
push 64
mov eax, 0
leave
ret
There's 2 things that fail now due to pushing to the stack.
The first is the leave instruction (see this thread)
The leave instruction will attempt to pop the value of rbp that was stored at the beginning of the function (notice the only push that the compiler generates is at the start: push rbp).
This is so that the stack frame of the caller is preserved following main. By pushing to the stack, in our case rbp is now going to be set to 64, since the last value pushed is 64. When the callee of main resumes it's execution, and tries to access a value at say, rbp - 8, a crash will occur, as rbp - 8 is 0x38 in hex, which is an invalid address.
But that assumes the callee even get's execution back!
After rbp has it's value restored with the invalid value, the next thing on the stack will be the original value of rbp.
The ret instruction will pop a value from the stack, and return to that address...
Notice how this might be slightly problematic?
The CPU is going to try and jump to the value of rbp stored at the start of the function!
On nearly every modern program, the stack is a "no execute" zone (see here), and attempting to execute code from there will immediately cause a crash.
So, TLDR: Pushing to the stack violates assumptions made by the compiler, most importantly about the return address of the function. This violation causes program execution to end up on the stack (generally), which will cause a crash

How to set function arguments in assembly during runtime in a 64bit application on Windows?

I am trying to set arguments using assembly code that are used in a generic function. The arguments of this generic function - that is resident in a dll - are not known during compile time. During runtime the pointer to this function is determined using the GetProcAddress function. However its arguments are not known. During runtime I can determine the arguments - both value and type - using a datafile (not a header file or anything that can be included or compiled). I have found a good example of how to solve this problem for 32 bit (C Pass arguments as void-pointer-list to imported function from LoadLibrary()), but for 64 bit this example does not work, because you cannot fill the stack but you have to fill the registers. So I tried to use assembly code to fill the registers but until now no success. I use C-code to call the assembly code. I use VS2015 and MASM (64 bit). The C-code below works fine, but the assembly code does not. So what is wrong with the assembly code? Thanks in advance.
C code:
...
void fill_register_xmm0(double); // proto of assembly function
...
// code determining the pointer to a func returned by the GetProcAddress()
...
double dVal = 12.0;
int v;
fill_register_xmm0(dVal);
v = func->func_i(); // integer function that will use the dVal
...
assembly code in different .asm file (MASM syntax):
TITLE fill_register_xmm0
.code
option prologue:none ; turn off default prologue creation
option epilogue:none ; turn off default epilogue creation
fill_register_xmm0 PROC variable: REAL8 ; REAL8=equivalent to double or float64
movsd xmm0, variable ; fill value of variable into xmm0
ret
fill_register_xmm0 ENDP
option prologue:PrologueDef ; turn on default prologue creation
option epilogue:EpilogueDef ; turn on default epilogue creation
END
The x86-64 Windows calling convention is fairly simple, and makes it possible to write a wrapper function that doesn't know the types of anything. Just load the first 32 bytes of args into registers, and copy the rest to the stack.
You definitely need to make the function call from asm; It can't possibly work reliably to make a bunch of function calls like fill_register_xmm0 and hope that the compiler doesn't clobber any of those registers. The C compiler emits instructions that use the registers, as part of its normal job, including passing args to functions like fill_register_xmm0.
The only alternative would be to write a C statement with a function call with all the args having the correct type, to get the compiler to emit code to make a function call normally. If there are only a few possible different combinations of args, putting those in if() blocks might be good.
And BTW, movsd xmm0, variable probably assembles to movsd xmm0, xmm0, because the first function arg is passed in XMM0 if it's FP.
In C, prepare a buffer with the args (like in the 32-bit case).
Each one needs to be padded to 8 bytes if it's narrower. See MS's docs for x86-64 __fastcall. (Note that x86-64 __vectorcall passes __m128 args by value in registers, but for __fastcall it's strictly true that the args form an array of 8-byte values, after the register args. And storing those into the shadow space creates a full array of all the args.)
Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers.
But the key thing that makes variadic functions easy in the Windows calling convention also works here: The register used for the 2nd arg doesn't depend on the type of the first. i.e. if an FP arg is the first arg, then that uses up an integer register arg-passing slot. So you can only have up to 4 register args, not 4 integer and 4 FP.
If the 4th arg is integer, it goes in R9, even if it's the first integer arg. Unlike in the x86-64 System V calling convention, where the first integer arg goes in rdi, regardless of how many earlier FP args are in registers and/or on the stack.
So the asm wrapper that calls the function can load the first 8 bytes into both integer and FP registers! (Variadic functions already require this, so a callee doesn't have to know whether to store the integer or FP register to form that arg array. MS optimized the calling convention for simplicity of variadic callee functions at the expense of efficiency for functions with a mix of integer and FP args.)
The C side that puts all the args into a buffer can look like this:
#include <stdalign.h>
int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...));
void somefunc() {
alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment
char *argp = (char*)argbuf;
for( ; argp < &argbuf[256] ; argp += 8) {
if (figure_out_an_arg()) {
int foo = get_int_arg();
memcpy(argp, &foo, sizeof(foo));
} else if(bar) {
double foo = get_double_arg();
memcpy(argp, &foo, sizeof(foo));
} else
... memcpy whatever size
// or allocate space to pass by ref and memcpy a pointer
}
if (argp == &argbuf[256]) {
// error, ran out of space for args
}
asmwrapper(argbuf, argp-argbuf, funcpointer);
}
Unfortunately I don't think we can directly use argbuf on the stack as the args + shadow space for a function call. We have no way of stopping the compiler from putting something valuable below argbuf which would let us just set rsp to the bottom of it (and save the return address somewhere, maybe at the top of argbuf by reserving some space for use by the asm).
Anyway, just copying the whole buffer will work. Or actually, load the first 32 bytes into registers (both integer and FP), and only copy the rest. The shadow space doesn't need to be initialized.
argbuf could be a VLA if you knew ahead of time how big it needed to be, but 256 bytes is pretty small. It's not like reading past the end of it can be a problem, it can't be at the end of a page with unmapped memory later, because our parent function's stack frame definitely takes some space.
;; NASM syntax. For MASM just rename the local labels and add whatever PROC / ENDPROC is needed.
;; UNTESTED
;; rcx: argbuf
;; rdx: length in bytes of the args. 0..256, zero-extended to 64 bits
;; r8 : function pointer
;; reserve rdx bytes of space for arg passing
;; load first 32 bytes of argbuf into integer and FP arg-passing registers
;; copy the rest as stack-args above the shadow space
global asmwrapper
asmwrapper:
push rbp
mov rbp, rsp ; so we can efficiently restore the stack later
mov r10, r8 ; move function pointer to a volatile but non-arg-passing register
; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf
; regardless of types or whether there were that many arg bytes
; All bytes are loaded into registers early, some reg->reg transfers are done later
; when we're done with more registers.
; movsd xmm0, [rcx]
; movsd xmm1, [rcx+8]
movaps xmm0, [rcx] ; 16-byte alignment required for argbuf. Use movups to allow misalignment if you want
movhlps xmm1, xmm0 ; use some ALU instructions instead of just loads
; rcx,rdx can't be set yet, still in use for wrapper args
movaps xmm2, [rcx+16] ; it's ok to leave garbage in the high 64-bits of an XMM passing a float or double.
;movhlps xmm3, xmm2 ; the copyloop uses xmm3: do this later
movq r8, xmm2
mov r9, [rcx+24]
mov eax, 32
cmp edx, eax
jbe .small_args ; no copying needed, just shadow space
sub rsp, rdx
and rsp, -16 ; reserve extra space, realigning the stack by 16
; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied)
.copyloop: ; do {
movaps xmm3, [rcx+rax]
movaps [rsp+rax], xmm3 ; indexed addressing modes aren't always optimal, but this loop only runs a couple times.
add eax, 16
cmp eax, edx
jb .copyloop ; } while(bytes_copied < arg_bytes);
.done_arg_copying:
; xmm0,xmm1 have the first 2 qwords of args
movq rcx, xmm0 ; RCX NO LONGER POINTS AT argbuf
movq rdx, xmm1
; xmm2 still has the 2nd 16 bytes of args
;movhlps xmm3, xmm2 ; don't use: false dependency on old value and we just used it.
pshufd xmm3, xmm2, 0xee ; xmm3 = high 64 bits of xmm2. (0xee = _MM_SHUFFLE(3,2,3,2))
; movq xmm3, r9 ; nah, can be multiple uops on AMD
; r8,r9 set earlier
call r10
leave ; restore RSP to its value on entry
ret
; could handle this branchlessly, but copy loop still needs to run zero times
; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes
; As much work as possible is before the first branch, so it can happen while a mispredict recovers
.small_args:
sub rsp, rax ; reserve shadow space
;rsp still aligned by 16 after push rbp
jmp .done_arg_copying
;byte count. This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function.
;e.g. maybe mov rcx, [rcx] instead of movq rcx, xmm0
;mov eax, $-asmwrapper
align 16
This does assemble (on Godbolt with NASM), but I haven't tested it.
It should perform pretty well, but if you get mispredicts around the cutoff from <= 32 bytes to > 32 bytes, change the branching so it always copies an extra 16 bytes. (Uncomment the cmp/cmovb in the version on Godbolt, but the copy loop still needs to start at 32 bytes into each buffer.)
If you often pass very few args, the 16-byte loads might hit a store-forwarding stall from two narrow stores to one wide reload, causing about an extra 8 cycles of latency. This isn't normally a throughput problem, but it can increase the latency before the called function can access its args. If out-of-order execution can't hide that, then it's worth using more load uops to load each 8-byte arg separately. (Especially into integer registers, and then from there to XMM, if the args are mostly integer. That will have lower latency than mem -> xmm -> integer.)
If you have more than a couple args, though, hopefully the first few have committed to L1d and no longer need store forwarding by the time the asm wrapper runs. Or there's enough copying of later args that the first 2 args finish their load + ALU chain early enough not to delay the critical path inside the called function.
Of course, if performance was a huge issue, you'd write the code that figures out the args in asm so you didn't need this copy stuff, or use a library interface with a fixed function signature that a C compiler can call directly. I did try to make this suck as little as possible on modern Intel / AMD mainstream CPUs (http://agner.org/optimize/), but I didn't benchmark it or tune it, so probably it could be improved with some time spent profiling it, especially for some real use-case.
If you know that FP args aren't a possibility for the first 4, you can simplify by just loading integer regs.
So you need to call a function (in a DLL) but only at run-time can you figure out the number and type of parameters. Then you need to perpare the parameters, either on the stack or in registers, depending on the Application Binary Interface/calling convention.
I would use the following approach: some component of your program figures out the number and type of parameters. Let's assume it creates a list of {type, value}, {type, value}, ...
You then pass this list to a function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bit), it just pushes the parameters on to the stack. For a register based ABI, it can prepare the register values and save them as local variables (add sp,nnn) and once all parameters have been prepared (possibly using registers needed for the call, hence first saving them), loads the registers (a series of mov instructions) and performs the call instruction.

What happens to a function called in C from the time being called to the time it returns?

Whenever I read about program execution in C, it speaks very less about the function execution. I am still trying to find out what happens to a function when the program starts executing it from the time it is been called from another function to the time it returns? How do the function arguments get stored in memory?
That's unspecified; it's up to the implementation. As pointed out by Keith Thompson, it doesn't even have to tell you how it works. :)
Some implementations will put all the arguments on the stack, some will use registers, and many use a mix (the first n arguments passed in registers, any more and they go on the stack).
But the function itself is just code, it's read-only and nothing much "happens" to it during execution.
There is no one correct answer to this question, it depends heavily upon how the compiler writer determines is the best model to do this. There are various bits in the standard that describes this process but most of it is implementation defined. Also, the process is dependent on the architecture of the system, the OS you're aiming for, the level of optimisation and so forth.
Take the following code:-
int DoProduct (int a, int b, int c)
{
return a * b * c;
}
int result = DoProduct (4, 5, 6);
The MSVC2005 compiler, using standard debug build options created this for the last line of the above code:-
push 6
push 5
push 4
call DoProduct (411186h)
add esp,0Ch
mov dword ptr [ebp-18h],eax
Here, the arguments are pushed onto the stack, starting with the last argument, then the penultimate argument and so on until the the first argument is pushed onto the stack. The function is called, then the arguments are removed from the stack (the add esp,0ch) and then the return value is saved - the result is stored in the eax register.
Here's the code for the function:-
push ebp
mov ebp,esp
sub esp,0C0h
push ebx
push esi
push edi
lea edi,[ebp-0C0h]
mov ecx,30h
mov eax,0CCCCCCCCh
rep stos dword ptr es:[edi]
mov eax,dword ptr [a]
imul eax,dword ptr [b]
imul eax,dword ptr [c]
pop edi
pop esi
pop ebx
mov esp,ebp
pop ebp
ret
The first thing the function does is to create a local stack frame. This involves creating a space on the stack to store local and temporary variables in. In this case, 192 (0xc0) bytes are reserved (the first three instructions). The reason it's so many is to allow the edit-and-continue feature some space to put new variables into.
The next three instructions save the reserved registers as defined by the MS compiler. Then the stack frame space just created is initialised to contain a special debug signature, in this case 0xCC. This means unitialised memory and if you ever see a value consisting of just 0xCC's in debug mode then you've forgotten to initialise the value (unless 0xCC was the value).
Once all that housekeeping has been done, the next three instructions implement the body of the function, the two multiplies. After that, the reserved registers are restored and then the stack frame destroyed and finally the function ends with a ret. Fortunately, the imul puts the result of the multiplication into the eax register so there's no special code to get the result into the right register.
Now, you've probably been thinking that there's a lot there that isn't really necessary. And you're right, but debug is about getting the code right and a lot of the above helps to achieve that. In release, there's a lot that can be got rid of. There's no need for a stack frame, no need, therefore, to initialise it. There's no need to save the reserved registers as they aren't modified. In fact, the compiler creates this:-
mov eax,dword ptr [esp+4]
imul eax,dword ptr [esp+8]
imul eax,dword ptr [esp+0Ch]
ret
which, if I'd let the compiler do it, would have been in-lined into the caller.
There's a lot more stuff that can happen: values passed in registers and so on. Also, I've not got into how floating point values and structures / classes as passed to and from functions. And there's more that I've probably left out.

Inline assembly that clobbers the red zone

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)
Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)
Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
What about creating a dummy function that is written in C and does nothing but call the inline assembly?

Help deciphering simple Assembly Code

I am learning assembly using GDB & Eclipse
Here is a simple C code.
int absdiff(int x, int y)
{
if(x < y)
return y-x;
else
return x-y;
}
int main(void) {
int x = 10;
int y = 15;
absdiff(x,y);
return EXIT_SUCCESS;
}
Here is corresponding assembly instructions for main()
main:
080483bb: push %ebp #push old frame pointer onto the stack
080483bc: mov %esp,%ebp #move the frame pointer down, to the position of stack pointer
080483be: sub $0x18,%esp # ???
25 int x = 10;
080483c1: movl $0xa,-0x4(%ebp) #move the "x(10)" to 4 address below frame pointer (why not push?)
26 int y = 15;
080483c8: movl $0xf,-0x8(%ebp) #move the "y(15)" to 8 address below frame pointer (why not push?)
28 absdiff(x,y);
080483cf: mov -0x8(%ebp),%eax # -0x8(%ebp) == 15 = y, and move it into %eax
080483d2: mov %eax,0x4(%esp) # from this point on, I am confused
080483d6: mov -0x4(%ebp),%eax
080483d9: mov %eax,(%esp)
080483dc: call 0x8048394 <absdiff>
31 return EXIT_SUCCESS;
080483e1: mov $0x0,%eax
32 }
Basically, I am asking to help me to make sense of this assembly code, and why it is doing things in this particular order. Point where I am stuck, is shown in assembly comments. Thanks !
Lines 0x080483cf to 0x080483d9 are copying x and y from the current frame on the stack, and pushing them back onto the stack as arguments for absdiff() (this is typical; see e.g. http://en.wikipedia.org/wiki/X86_calling_conventions#cdecl). If you look at the disassembler for absdiff() (starting at 0x8048394), I bet you'll see it pick these values up from the stack and use them.
This might seem like a waste of cycles in this instance, but that's probably because you've compiled without optimisation, so the compiler does literally what you asked for. If you use e.g. -O2, you'll probably see most of this code disappear.
First it bears saying that this assembly is in the AT&T syntax version of x86_32, and that the order of arguments to operations is reversed from the Intel syntax (used with MASM, YASM, and many other assemblers and debuggers).
080483bb: push %ebp #push old frame pointer onto the stack
080483bc: mov %esp,%ebp #move the frame pointer down, to the position of stack pointer
080483be: sub $0x18,%esp # ???
This enters a stack frame. A frame is an area of memory between the stack pointer (esp) and the base pointer (ebp). This area is intended to be used for local variables that have to live on the stack. NOTE: Stack frames don't have to be implemented in this way, and GCC has the optimization switch -fomit-frame-pointer that does away with it except when alloca or variable sized arrays are used, because they are implemented by changing the stack pointer by arbitrary values. Not using ebp as the frame pointer allows it to be used as an extra general purpose register (more general purpose registers is usually good).
Using the base pointer makes several things simpler to calculate for compilers and debuggers, since where variables are located relative to the base does not change while in the function, but you can also index them relative to the stack pointer and get the same results, though the stack pointer does tend to change around so the same location may require a different index at different times.
In this code 0x18 (or 24) bytes are being reserved on the stack for local use.
This code so far is often called the function prologue (not to be confused with the programming language "prolog").
25 int x = 10;
080483c1: movl $0xa,-0x4(%ebp) #move the "x(10)" to 4 address below frame pointer (why not push?)
This line moves the constant 10 (0xA) to a location within the current stack frame relative to the base pointer. Because the base pointer below the top of the stack and since the stack grows downward in RAM the index is negative rather than positive. If this were indexed relative to the stack pointer a different index would be used, but it would be positive.
You are correct that this value could have been pushed rather than copied like this. I suspect that this is done this way because you have not compiled with optimizations turned on. By default gcc (which I assume you are using based on your use of gdb) does not optimize much, and so this code is probably the default "copy a constant to a location in the stack frame" code. This may not be the case, but it is one possible explanation.
26 int y = 15;
080483c8: movl $0xf,-0x8(%ebp) #move the "y(15)" to 8 address below frame pointer (why not push?)
Similar to the previous line of code. These two lines of code put the 10 and 15 into local variables. They are on the stack (rather than in registers) because this is unoptimized code.
28 absdiff(x,y);
gdb printing this meant that this is the source code line being executed, not that this function is being executed (yet).
080483cf: mov -0x8(%ebp),%eax # -0x8(%ebp) == 15 = y, and move it into %eax
In preparation for calling the function the values that are being passed as arguments need to be retrieved from their storage locations (even though they were just placed at those locations and their values are known because of the no optimization thing)
080483d2: mov %eax,0x4(%esp) # from this point on, I am confused
This is the second part of the move to the stack of one of the local variables' value so that it can be use as an argument to the function. You can't (usually) move from one memory address to another on x86, so you have to move it through a register (eax in this case).
080483d6: mov -0x4(%ebp),%eax
080483d9: mov %eax,(%esp)
These two lines do the same thing except for the other variable. Note that since this variable is being moved to the top of the stack that no offset is being used in the second instruction.
080483dc: call 0x8048394 <absdiff>
This pushed the return address to the top of the stack and jumps to the address of absdiff.
You didn't include code for absdiff, so you probably did not step through that.
31 return EXIT_SUCCESS;
080483e1: mov $0x0,%eax
C programs return 0 upon success, so EXIT_SUCCESS was defined as 0 by someone. Integer return values are put in eax, and some code that called the main function will use that value as the argument when calling the exit function.
32 }
This is the end. The reason that gdb stopped here is that there are things that actually happen to clean up. In C++ it is common to see destructor for local class instances being called here, but in C you will probably just see the function epilogue. This is the compliment to the function prologue, and consists of returning the stack pointer and base pointer to the values that they were originally at. Sometimes this is done with similar math on them, but sometimes it is done with the leave instruction. There is also an enter instruction which can be used for the prologue, but gcc doesn't do this (I don't know why). If you had continued to view the disassembly here you would have seen the epilogue code and a ret instruction.
Something you may be interested in is the ability to tell gcc to produce assembly files. If you do:
gcc -S source_file.c
a file named source_file.s will be produced with assembly code in it.
If you do:
gcc -S -O source_file.c
Then the same thing will happen, but some basic optimizations will be done. This will probably make reading the assembly code easier since the code will not likely have as many odd instructions that seem like they could have been done a better way (like moving constant values to the stack, then to a register, then to another location on the stack and never using the push instruction).
You regular optimization flags for gcc are:
-O0 default -- none
-O1 a few optimizations
-O the same as -O1
-O2 a lot of optimizations
-O3 a bunch more, some of which may take a long time and/or make the code a lot bigger
-Os optimize for size -- similar to -O2, but not quite
If you are actually trying to debug C programs then you will probably want the least optimizations possible since things will happen in the order that they are written in your code and variables won't disappear.
You should have a look at the gcc man page:
man gcc
Remember, if you're running in a debugger or debug mode, the compiler reserves the right to insert whatever debugging code it likes and make other nonsensical code changes.
For example, this is Visual Studio's debug main():
int main(void) {
001F13D0 push ebp
001F13D1 mov ebp,esp
001F13D3 sub esp,0D8h
001F13D9 push ebx
001F13DA push esi
001F13DB push edi
001F13DC lea edi,[ebp-0D8h]
001F13E2 mov ecx,36h
001F13E7 mov eax,0CCCCCCCCh
001F13EC rep stos dword ptr es:[edi]
int x = 10;
001F13EE mov dword ptr [x],0Ah
int y = 15;
001F13F5 mov dword ptr [y],0Fh
absdiff(x,y);
001F13FC mov eax,dword ptr [y]
001F13FF push eax
001F1400 mov ecx,dword ptr [x]
001F1403 push ecx
001F1404 call absdiff (1F10A0h)
001F1409 add esp,8
*(int*)nullptr = 5;
001F140C mov dword ptr ds:[0],5
return 0;
001F1416 xor eax,eax
}
001F1418 pop edi
001F1419 pop esi
001F141A pop ebx
001F141B add esp,0D8h
001F1421 cmp ebp,esp
001F1423 call #ILT+300(__RTC_CheckEsp) (1F1131h)
001F1428 mov esp,ebp
001F142A pop ebp
001F142B ret
It helpfully posts the C++ source next to the corresponding assembly. In this case, you can fairly clearly see that x and y are stored on the stack explicitly, and an explicit copy is pushed on, then absdiff is called. I explicitly de-referenced nullptr to cause the debugger to break in. You may wish to change compiler.
Compile with -fverbose-asm -g -save-temps for additional information with GCC.

Resources