I'm trying to put some C hooks into some code that someone else wrote in asm. I don't know much about x86 asm and I want to make sure I'm doing this safely.
So far I've got this much:
EXTERN _C_func
CALL _C_func
And that seems to work, but I'm sure that's dangerous as-is. At the very least it could be destroying EAX, and possibly more.
So I think all I need to know is: Which registers does C destroy and how do I ensure the registers are being saved properly? Is there anything else I can do to make sure I can safely insert hooks into arbitrary locations?
I'm using Visual Studio for the C functions, if that helps.
Thank you.
UPDATE:
I looked up the "paranoid" method several people suggested and it looks like this:
pushfd
pushad
call _C_func
popad
popfd
AD = (A)ll general registers
FD = (F)lags
(I'm not sure what the 'd' stands for, but it means 32-bit registers instead of 16-bit.)
But the most efficient method is illustrated in Kragen's answer below. (Assuming you don't need the preserve the flags. If you do, add the FD instructions.)
Also, be careful about allocating lots of stack variables in your C function as it could overrun the asm subroutine's stack space.
I think that about does it, thanks for you help everyone!
If the calling convention is cdecl and the signature of C_funct is simply:
void C_func(void);
Then that is perfectly "safe", however the registers EAX, ECX and EDX are "available for use inside the function", and so may be overwritten.
To protect against this you can save the registers that you care about and restore them afterwards:
push eax
push ecx
push edx
call _C_func
pop edx
pop ecx
pop eax
By convention the registers EBX, ESI, EDI, and EBP shouldn't be modified by the callee.
I believe that the flags may be modified by the callee - again if you care about preserving the value of flags then you should save them.
See the Wikipedia page on x86 calling conventions.
You need to be aware of calling conventions. See this article for a discussion on calling conventions.
If you are paranoid about registers, you can always push all registers onto the stack before calling the function (functions shouldn't return without cleaning up, except for those registers that carry return information).
This will depend on whether you are using an x86 or x86_64 platform ... the calling conventions for each, as well as the caller-save and callee-save registers (and even the available register set) are a bit different.
That being said, caller-save registers need to be pushed on the stack before you call your C-function. Callee-save registers you don't need to worry about if you are calling a C-function, but you will need to pay attention to if you call an assembly routine from a C-function.
For a x86 32-bit platform, the caller-save registers are:
EAX
EDX
ECX
Also keep in mind that the stack-pointer registers, ESP and EBP will be changed as well with cdecl calling convention, but a C-function will restore those values after the call assuming you have not done something that a C-function would not expect you to-do with those registers.
The callee-save registers are:
EBX
ESI
EDI
Finally, the return value from your C-function is in the EAX register, and the C-function arguments are pushed onto the stack before calling the function. The order in which that is done will depend on the calling convention, but for cdecl, that will be in right-to-left order, meaning the left-most argument to the C-function is pushed last on the stack before calling the function.
Should you decide to simply save all the general purpose registers on the stack before you call your C-function, and then pop them back off the stack after the C-function, you can do that on x86 using the PUSHAD and POPAD instructions. These instructions are not useable on x86_64 though. And also keep in mind that if you have a return value in EAX, you will need to-do something with that (like saving it in memory), before you call POPAD.
Related
What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax
The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.
Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's consider a situation where we are writing in C code. When the compiler encounters a function call, my understanding is that it does the following:
Push all registers onto the stack
Jump to new function, do stuff in there
Pop old context off the stack back into the registers.
Now, some processors have 1 working register, some 32, some more than that. I'm mostly concerned with the larger number of registers. If my processor has 32 registers, the compiler will need to emit 32 push and pop instructions, just as base overhead for a function call. It would be nice if I could trade some compilation flexibility[1] in the function for less push and pop instructions. That is to say, I would like a way that I could tell the compiler "For function foo(), only use 4 registers. This would imply that the compiler would only need to push/pop 4 registers before jumping to foo().
I realize this is pretty silly to worry about on a modern PC, but I am thinking more for a low speed embedded system where you might be servicing an interrupt very quickly, or calling a simple function over and over. I also realize this could very quickly become an architecture dependant feature. Processors that use a "Source Source -> Dest" instruction set (Like ARM), as opposed to an accumulator (Like Freescale/NXP HC08) might have some lower limit on the number of registers we allow functions to use.
I do know the compiler uses tricks like inlining small functions to increase speed, and I realize I could inform most compilers to not generate the push/pop code and just hand code it myself in assembly, but my question focuses on instructing the compiler to do this from "C-Land".
My question is, are there compilers that allow this? Is this even necessary with optimizing compilers (do they already do this)?
[1] Compilation flexibility: By reducing the number of registers available to the compiler to use in a function body, you are restricting it's flexibility, and it might need to utilize the stack more since it can't just use another register.
When it comes to compilers, registers and function calls you can generally think of the registers falling into one of three categories: "hands off", volatile and non-volatile.
The "hands off" category are those that the compiler will not generally be futzing around with unless you explicitly tell it to (such as with inline assembly). These may include debugging registers and other special purpose registers. The list will vary from platform to platform.
The volatile (or scratch / call-clobbered / caller-saved) set of registers are those that a function can futz around with without the need for saving. That is, the caller understands that the contents of those registers might not be the same after the function call. Thus, if the caller has any data in those registers that it wants to keep, it must save that data before making the call and then restore it after. On a 32-bit x86 platform, these volatile registers (sometimes called scratch registers) are usually EAX, ECX and EDX.
The non-volatile (or call-preserved or callee-saved) set of registers are those that a function must save before using them and restore to their original values before returning. They only need to be saved/restored by the called function if it uses them. On a 32-bit x86 platform, these are usually the remaining general purpose registers: EBX, ESI, EDI, ESP, EBP.
Hope this helps.
(I meant to just add a small example, but quickly got carried away. I would add my own answer if this question wasn't closed, but I'm going to leave this long section here because I think it's interesting. Condense it or edit it out entirely if you don't want it in your answer -- Peter)
For a more concrete example, the SysV x86-64 ABI is well-designed (with args passed in registers, and a good balance of call-preserved vs. scratch/arg regs). There are some other links in the x86 tag wiki explaining what ABIs / calling conventions are all about.
Consider a simple example of with function calls that can't be inlined (because the definition isn't available):
int foo(int);
int bar(int a) {
return 5 * foo(a+2) + foo (a) ;
}
It compiles (on godbolt with gcc 5.3 for x86-64 with -O3 to the following:
## gcc output
# AMD64 SysV ABI: first arg in e/rdi, return value in e/rax
# the call-preserved regs used are: rbp and rbx
# the scratch regs used are: rdx. (arg-passing / return regs are not call-preserved)
push rbp # save a call-preserved reg
mov ebp, edi # stash `a` in a call-preserved reg
push rbx # save another call-preserved reg
lea edi, [rdi+2] # edi=a+2 as an arg for foo. `add edi, 2` would also work, but they're both 3 bytes and little perf difference
sub rsp, 8 # align the stack to a 16B boundary (the two pushes are 8B each, and call pushes an 8B return address, so another 8B is needed)
call foo # eax=foo(a+2)
mov edi, ebp # edi=a as an arg for foo
mov ebx, eax # stash foo(a+2) in ebx
call foo # eax=foo(a)
lea edx, [rbx+rbx*4] # edx = 5*foo(a+2), using the call-preserved register
add rsp, 8 # undo the stack offset
add eax, edx # the add between the to function-call results
pop rbx # restore the call-preserved regs we saved earlier
pop rbp
ret # return value in eax
As usual, compilers could do better: instead of stashing foo(a+2) in ebx to survive the 2nd call to foo, it could have stashed 5*foo(a+2) with a single instruction (lea ebx, [rax+rax*4]). Also, only one call-preserved register is needed, since we don't need a after the 2nd call. This removes a push/pop pair, and also the sub rsp,8 / add rsp,8 pair. (gcc bug report already filed for this missed optimization)
## Hand-optimized implementation (still ABI-compliant):
push rbx # save a call-preserved reg; also aligns the stack
lea ebx, [rdi+2] # stash ebx=a+2
call foo # eax=foo(a)
mov edi, ebx # edi=a+2 as an arg for foo
mov ebx, eax # stash foo(a) in ebx, replacing `a+2` which we don't need anymore
call foo # eax=foo(a+2)
lea eax, [rax+rax*4] #eax=5*foo(a+2)
add eax, ebx # eax=5*foo(a+2) + foo(a)
pop rbx # restore the call-preserved regs we saved earlier
ret # return value in eax
Note that the call to foo(a) happens before foo(a+2) in this version. It saved an instruction at the start (since we can pass on our arg unchanged to the first call to foo), but removed a potential saving later (since the multiply-by-5 now has to happen after the second call, and can't be combined with moving into the call-preserved register).
I could get rid of an extra mov if it was 5*foo(a) + foo(a+2). With the expression as I wrote it, I can't combine arithmetic with data movement (using lea) in every case. Or I'd need to both save a and do a separate add edi,2 before the first call.
Push all registers onto the stack
No. In the vast majority of function calls in optimized code, only a small fraction of all registers are pushed on the stack.
I'm mostly concerned with the larger number of registers.
Do you have any experimental evidence to support this concern? Is this a performance bottleneck?
I could trade some compilation flexibility[1] in the function for less
push and pop instructions.
Modern compilers use sophisticated inter-procedural register allocation. By limiting the number of registers, you will most likely degrade performance.
I realize this is pretty silly to worry about on a modern PC, but I am
thinking more for a low speed embedded system where you might be
servicing an interrupt very quickly, or calling a simple function over
and over.
This is very vague. You have to show the "simple" function, all call sites and specify the compiler and the target embedded system. You need to measure performance (compared to hand-written assembly code) to determine whether this is a problem in the first place.
I've written a simple assembly program:
section .data
str_out db "%d ",10,0
section .text
extern printf
extern exit
global main
main:
MOV EDX, ESP
MOV EAX, EDX
PUSH EAX
PUSH str_out
CALL printf
SUB ESP, 8 ; cleanup stack
MOV EAX, EDX
PUSH EAX
PUSH str_out
CALL printf
SUB ESP, 8 ; cleanup stack
CALL exit
I am the NASM assembler and the GCC to link the object file to an executable on linux.
Essentially, this program is first putting the value of the stack pointer into register EDX, it is then printing the contents of this register twice. However, after the second printf call, the value printed to the stdout does not match the first.
This behaviour seems strange. When I replace every usage of EDX in this program with EBX, the outputted integers are identical as expected. I can only infer that EDX is overwritten at some point during the printf function call.
Why is this the case? And how can I make sure that the registers I use in future don't conflict with C lib functions?
According to the x86 ABI, EBX, ESI, EDI, and EBP are callee-save registers and EAX, ECX and EDX are caller-save registers.
It means that functions can freely use and destroy previous values EAX, ECX, and EDX.
For that reason, save values of EAX, ECX, EDX before calling functions if you don't want their values to change. It is what "caller-save" mean.
Or better, use other registers for values that you're still going to need after a function call. push/pop of EBX at the start/end of a function is much better than push/pop of EDX inside a loop that makes a function call. When possible, use call-clobbered registers for temporaries that aren't needed after the call. Values that are already in memory, so they don't need to written before being re-read, are also cheaper to spill.
Since EBX, ESI, EDI, and EBP are callee-save registers, functions have to restore the values to the original for any of those they modify, before returning.
ESP is also callee-saved, but you can't mess this up unless you copy the return address somewhere.
The ABI for the target platform (e.g. 32bit x86 Linux) defines which registers can be used by functions without saving. (i.e., if you want them preserved across a call, you have to do it yourself).
Links to ABI docs for Windows and non-Window, 32 and 64bit, at https://stackoverflow.com/tags/x86/info
Having some registers that aren't preserved across calls (available as scratch registers) means functions can be smaller. Simple functions can often avoid doing any push/pop save/restores. This cuts down on the number of instructions, leading to faster code.
It's important to have some of each: having to spill all state to memory across calls would bloat the code of non-leaf functions, and slow things down esp. in cases where the called function didn't touch all the registers.
See also What are callee and caller saved registers? for more about call-preserved vs. call-clobbered registers in general.
If I call an ARM assembly function from C, sometimes I need to pass in many arguments. If they do not fit in registers r0, r1, r2, r3 it is generally expected that 5-th, 6-th ... x-th arguments are pushed onto stack so that ARM assembly can read them from it.
So in the ARM function I receive some arguments that are on the stack. After finishing the assembly function I can either remove these arguments from stack or leave them there and expect that the C program will deal with them later.
If we are talking about GCC C and ARM assembly who is usually responsible for cleaning up the stack?
The function that made the call (A)
Or the function that was called (B)
I understand that when developing we could agree on either convention. But what is generally used as the default in this particular case (ARM assembly and GCC C)?
And how would generally a low level piece of code describe which behavior it implements? It seems that there should be some kind of standard description for this. If there isn't one it seems that you pretty much just have to try them both and look at which one does not crash.
If someone is interested in how the code could look like:
arm_function:
stmfd sp, {r4-r12, lr} # Save registers that are not the first three registers, SP->PASSED ARGUMENTS
ldmfd sp, {r4-r6} # Load 3 arguments that were passed through the stack, SP->PASSED ARGUMENTS
sub sp, sp, #40 # Adjust the stack pointer so it points to saved registers, STACK POINTER->SAVED REGISTERS->PASSED ARGUMENTS
#The main function body.
ldmfd sp!, {r4-r12, lr}, # Load saved registers STACK POINTER->PASSED ARGUMENTS
add sp, sp, #12 # Increment stack pointer to remove passed arguments, SP->NOTHING
# If the last code line would not be there, the caller would need to remove the arguments from stack.
UPDATE:
It seems that for C/C++ choice A. is pretty standard. Compilers usually use calling conventions like cdecl that work pretty similar to code in the answers below. More information can be found in this link about calling conventions. Changing C/C++ calling convention for a function does not seem to be so common/easy. With older C standard I could not manage to change it, so it looks like using A should be a decent default choice.
The current ARM procedure call standard is AAPCS.
The language-specific ABI can be found here. Relevant will be the document about C, but others should be similar (why reinvent the wheel?).
A good start for reading might be page 14 in the AAPCS.
It basically requires the caller to clean up the stack, as this is the most simple way: push additional arguments onto the stack, call the function and after return simply adjust the stack pointer by adding an offset (the number of bytes pushed on the stack; this is always a multiple of 4 (the "natural 32bit ARM word size).
But if you use gcc, you can just avoid handling the stack yourself by using inline assembler. This provides features to pass C variables (etc.) to the assembler code. This will also automatically load a parameter into a register if required. Just have a look at the gcc documentation. It is a bit hard to figure out in detail, but I prefer this to having raw assember stubs somewhere.
Ok, i added this as there might be problems understanding the principle:
caller:
...
push r5 // argument which does not fit into r0..r3 anymore
bl callee
add sp,4 // adjust SP
callee:
push r5-r7,lr // temp, variables, return address
sub sp,8 // local variables
// processing
add sp, 8 // restore previous stack frame
pop r5-r7,pc // restore temp. variables and return (replaces bx)
You can verify this by just disassmbling some sample C functions. Note that the pre- and postamble may vary if no temp registers are used or the function does not call another function (no need to stack lr for this).
Also, the caller might have to stack r0..r3 before the call. But that is a matter of compiler optimizations.
Disassembly can be done with gdb and objdump for example.
I use -mabi=aapcs for gcc invocation; not sure if gcc would otherwise use a different standard. Note that all object files have to use the same standard.
Edit:
Just had a peek in the AAPCS and that states that the SP need only 4 byte alignment. I might have confused this with the Cortex-M interrupt handling system which (for whatever reason, possibly for M7 which has 64 bit busses) aligns the SP to 8 bytes by default (software-config option).
However, SP must be 8 byte aligned at a public interface. Ok, the standard actually is more complicated than I remembered. That's why I prefer gcc caring about this stuff.
If some spaces allocated on the stack by caller function (argument passing), stack clearance done within the caller function. And how it happens you may ask. In ARM #Olaf has completely cleared, and in x86 it is usually like this:
sub esp, 8 ; make some room
... ; move arguments on stack
call func
add esp, 8 ; clean the stack
or
push eax ; push the arguments
push ebx ; or pusha, then after call, popa
call func
add esp, 8 ; assuming registers are 4 bytes each
Also how the interaction between caller and callee in a system takes places is explained in ABI (Application Binary Interface) You may find it useful.
I'd like to know how many instructions are needed for a function call in a C program compiled with gcc for x86 platforms from start to finish.
Write some code.
Compile it.
Look at the disassembly.
Count the instructions.
The answer will vary as you vary the number and type of parameters, calling conventions etc.
That is a really tricky question that's hard to answer and it may vary.
First of all in the caller it is needed to pass the parameters, depending on the type this will vary, in most cases you will have a push instruction for each parameter.
Then, in the called procedure the first instructions will be to do the allocation for local variables. This is usually done in 3 operations:
PUSH EBP
MOV EBP, ESP
SUB ESP, xxx
You will have the assembly code of the function after that.
Following the code but before the return, the ebp and esp will be restored:
MOV ESP, EBP
POP EBP
Lastly, you will have a ret instruction that depending on the calling convention will dealocate the parameters of the stack or it will leave that to the caller. You can determine this if the RET is with a number as parameter or if the parameter is 0, respectively. In case the parameter is 0 you will have POP instructions in the caller after the CALL instruction.
I would expect at least one
CALL Function
unless it is inlined, of course.
If you use -mno-accumulate-outgoing-args and -Os (or -mpreferred-stack-boundary=2, or 3 on 64-bit), then the overhead is exactly one push per argument word-sized argument, one call, and one add to adjust the stack pointer after return.
Without -mno-accumulate-outgoing-args and with default 16-byte stack alignment, gcc generates code that's roughly the same speed but roughly five times larger for function calls, for no good reason.