How to know the count of arguments when implementing backtrace in C - c

I am implementing a backtrace function in C, which can output caller's info. like this
ebp:0x00007b28 eip:0x00100869 args:0x00000000 0x00640000 0x00007b58 0x00100082
But how can I know the count of arguments of the caller?
Thank you very much

You can deduce the numbers of arguments a function uses in 32bit x86 code under some circumstances.
If the code has been compiled to use framepointers, then a given function's stackframe extends between (highest address) EBP and (lowest address / stack top) ESP. Immediately above the stack end at EBP you find the return address, and again above that you'll have, if your code is using the C calling convention (cdecl), consecutively, arg[0...].
That means: arg[0] at [EBP + 4], arg[1] at [EBP + 8 ], and so on.
When you disassemble function, look for instructions referencing [EBP + ...] and you know they access function arguments. The highest offset value used tells you how many there are.
This is of course somewhat simplified; arguments with sizes different from 32bits, code that doesn't use cdecl but e.g. fastcall, code where the framepointer has been optimized makes the method trip, at least partially.
Another option, again for cdecl functions, is to look at the return address (location of the call into the func you're interested in), and disassemble around there; you will, in many cases, find a sequence push argN; push ...; push arg0; call yourFunc and you can deduce how many arguments were passed in this instance. That's in fact the only way (from the code alone) to test how many arguments were passed to functions like printf() in a particular instance.
Again, not perfect - these days, compilers often preallocate stackspace and then use mov to write arguments instead of pushing them (on some CPUs, this is better since sequences of push instructions have dependencies on each other due to each modifying the stackpointers).
Since all these methods are heuristic this requires quite a bit of coding to automate. If compiler-generated debugging information is available, use that - it's faster.
Edit: There's another useful heuristic that can be done; Compiler-generated code for function calling often looks like this:
...
[ code that either does "push arg" or "mov [ESP ...], arg" ]
...
call function
add ESP, ...
The add instruction is there to clean up stackspace used for arguments. From the size of the immediate operand, you know how much space the args this code gave to function has used, and by implication (assuming they're all 32bit, for example), you know how many there were.
This is particularly simple given you already have the address of said add instruction if you have working backtrace code - the instruction at the return address is this add. So you can often get away with simply trying to disassemble the (single) instruction at the return address, and see if it's an add ESP, ... (sometimes it's a sub ESP, -...) and if so, calculate the number of arguments passed from the immediate operand. The code for that is much simpler than having to pull in a full disassembly library.

You can't. The number of arguments isn't saved anywhere, as you can see in this simple disassembly:
f(5);
002B144E push 5
002B1450 call f (2B11CCh)
002B1455 add esp,4
g(1, "foo");
002B1458 push offset string "foo" (2B5740h)
002B145D push 1
002B145F call g (2B11C7h)
002B1464 add esp,8
h("bar", 'd', 8);
002B1467 push 8
002B1469 push 64h
002B146B push offset string "bar" (2B573Ch)
002B1470 call h (2B11D1h)
002B1475 add esp,0Ch
Basically, only the called function knows how many arguments it has.

As Yahia commented, there's no general way.
You'll probably need to parse the debug information placed by the debugger (assuming you have compiled with gcc -g).

Glibc implements a backtrace function. It unwind backtrace, arg by arg.
You can see how they've done it in sysdeps/$ARCH/backtrace.c. Beware that it's quite hard to read.

Related

GCC C and ARM Assembly Stack Cleanup

If I call an ARM assembly function from C, sometimes I need to pass in many arguments. If they do not fit in registers r0, r1, r2, r3 it is generally expected that 5-th, 6-th ... x-th arguments are pushed onto stack so that ARM assembly can read them from it.
So in the ARM function I receive some arguments that are on the stack. After finishing the assembly function I can either remove these arguments from stack or leave them there and expect that the C program will deal with them later.
If we are talking about GCC C and ARM assembly who is usually responsible for cleaning up the stack?
The function that made the call (A)
Or the function that was called (B)
I understand that when developing we could agree on either convention. But what is generally used as the default in this particular case (ARM assembly and GCC C)?
And how would generally a low level piece of code describe which behavior it implements? It seems that there should be some kind of standard description for this. If there isn't one it seems that you pretty much just have to try them both and look at which one does not crash.
If someone is interested in how the code could look like:
arm_function:
stmfd sp, {r4-r12, lr} # Save registers that are not the first three registers, SP->PASSED ARGUMENTS
ldmfd sp, {r4-r6} # Load 3 arguments that were passed through the stack, SP->PASSED ARGUMENTS
sub sp, sp, #40 # Adjust the stack pointer so it points to saved registers, STACK POINTER->SAVED REGISTERS->PASSED ARGUMENTS
#The main function body.
ldmfd sp!, {r4-r12, lr}, # Load saved registers STACK POINTER->PASSED ARGUMENTS
add sp, sp, #12 # Increment stack pointer to remove passed arguments, SP->NOTHING
# If the last code line would not be there, the caller would need to remove the arguments from stack.
UPDATE:
It seems that for C/C++ choice A. is pretty standard. Compilers usually use calling conventions like cdecl that work pretty similar to code in the answers below. More information can be found in this link about calling conventions. Changing C/C++ calling convention for a function does not seem to be so common/easy. With older C standard I could not manage to change it, so it looks like using A should be a decent default choice.
The current ARM procedure call standard is AAPCS.
The language-specific ABI can be found here. Relevant will be the document about C, but others should be similar (why reinvent the wheel?).
A good start for reading might be page 14 in the AAPCS.
It basically requires the caller to clean up the stack, as this is the most simple way: push additional arguments onto the stack, call the function and after return simply adjust the stack pointer by adding an offset (the number of bytes pushed on the stack; this is always a multiple of 4 (the "natural 32bit ARM word size).
But if you use gcc, you can just avoid handling the stack yourself by using inline assembler. This provides features to pass C variables (etc.) to the assembler code. This will also automatically load a parameter into a register if required. Just have a look at the gcc documentation. It is a bit hard to figure out in detail, but I prefer this to having raw assember stubs somewhere.
Ok, i added this as there might be problems understanding the principle:
caller:
...
push r5 // argument which does not fit into r0..r3 anymore
bl callee
add sp,4 // adjust SP
callee:
push r5-r7,lr // temp, variables, return address
sub sp,8 // local variables
// processing
add sp, 8 // restore previous stack frame
pop r5-r7,pc // restore temp. variables and return (replaces bx)
You can verify this by just disassmbling some sample C functions. Note that the pre- and postamble may vary if no temp registers are used or the function does not call another function (no need to stack lr for this).
Also, the caller might have to stack r0..r3 before the call. But that is a matter of compiler optimizations.
Disassembly can be done with gdb and objdump for example.
I use -mabi=aapcs for gcc invocation; not sure if gcc would otherwise use a different standard. Note that all object files have to use the same standard.
Edit:
Just had a peek in the AAPCS and that states that the SP need only 4 byte alignment. I might have confused this with the Cortex-M interrupt handling system which (for whatever reason, possibly for M7 which has 64 bit busses) aligns the SP to 8 bytes by default (software-config option).
However, SP must be 8 byte aligned at a public interface. Ok, the standard actually is more complicated than I remembered. That's why I prefer gcc caring about this stuff.
If some spaces allocated on the stack by caller function (argument passing), stack clearance done within the caller function. And how it happens you may ask. In ARM #Olaf has completely cleared, and in x86 it is usually like this:
sub esp, 8 ; make some room
... ; move arguments on stack
call func
add esp, 8 ; clean the stack
or
push eax ; push the arguments
push ebx ; or pusha, then after call, popa
call func
add esp, 8 ; assuming registers are 4 bytes each
Also how the interaction between caller and callee in a system takes places is explained in ABI (Application Binary Interface) You may find it useful.

redirecting a c function at runtime and calling the original function

My program redirects a function to another function by writing a jmp instruction to the first few bytes of the function (only i386). It works like expected but it means that I can't call the original function anymore, because it will always jump to the new one.
There are two possible workarounds I could think of:
Create a new function, which overwrites the jmp instruction of the target function and call it. Afterwards the function writes back the jmp instruction.
But I'm not sure how to pass the arguments since there can be any number of them. And I wonder if the target function can jmp somewhere else and skip writing back the jmp instruction (like throw catch?).
Create a new function which executes the code I have overwritten with the jmp instruction. But I can't be sure that the overwritten data is a complete instruction. I'd have to know how many bytes I have to copy for a complete instructions.
So, finally, my questions:
Is there another way I didn't think of?
How do I find the size of an instruction? I already looked at binutils and found this but I don't know how to interpret it.
Here is a sample:
mov, 2, 0xa0, None, 1, Cpu64, D|W|CheckRegSize|No_sSuf|No_ldSuf, { Disp64|Unspecified|Byte|Word|Dword|Qword, Acc|Byte|Word|Dword|Qword }
the 2nd column shows the number of operands (2) and the last column has information about the operands, seperated by a comma
I also found this question which is pretty much the same but I can't be sure that the 7 bytes contain a whole instruction.
Writing a Trampoline Function
Any help is appreciated! Thanks.
Sebastian, you can use the exe_load_symbols() function in hotpatch to get a list of the symbols and their location in the existing exe and then see if you can overwrite that in memory. I have not tried it yet. You may be able to do it with the LD_PRELOAD environment variable as well instead of hotpatch.
--Vikas
How about something like this:
Let's say this is the original function:
Instruction1
Instruction2
Instruction3
...
RET
convert it to this:
JMP new_stuff
old:
Instruction2
Instruction3
...
RET
...
new_stuff:
CMP call_my_function,0
JNZ my_function
Instruction1
JMP old
my_function:
...
Of course you'd have to take the size of the original instructions into account (you could find that out by disassembling with objdump, for example) so that the first JMP fits perfectly (pad with NOPs if the JMP is shorter than the original instruction(s)).

How many machine instructions are needed for a function call in C?

I'd like to know how many instructions are needed for a function call in a C program compiled with gcc for x86 platforms from start to finish.
Write some code.
Compile it.
Look at the disassembly.
Count the instructions.
The answer will vary as you vary the number and type of parameters, calling conventions etc.
That is a really tricky question that's hard to answer and it may vary.
First of all in the caller it is needed to pass the parameters, depending on the type this will vary, in most cases you will have a push instruction for each parameter.
Then, in the called procedure the first instructions will be to do the allocation for local variables. This is usually done in 3 operations:
PUSH EBP
MOV EBP, ESP
SUB ESP, xxx
You will have the assembly code of the function after that.
Following the code but before the return, the ebp and esp will be restored:
MOV ESP, EBP
POP EBP
Lastly, you will have a ret instruction that depending on the calling convention will dealocate the parameters of the stack or it will leave that to the caller. You can determine this if the RET is with a number as parameter or if the parameter is 0, respectively. In case the parameter is 0 you will have POP instructions in the caller after the CALL instruction.
I would expect at least one
CALL Function
unless it is inlined, of course.
If you use -mno-accumulate-outgoing-args and -Os (or -mpreferred-stack-boundary=2, or 3 on 64-bit), then the overhead is exactly one push per argument word-sized argument, one call, and one add to adjust the stack pointer after return.
Without -mno-accumulate-outgoing-args and with default 16-byte stack alignment, gcc generates code that's roughly the same speed but roughly five times larger for function calls, for no good reason.

Recognizing stack frames in a stack using saved EBP values

I would like to divide a stack to stack-frames by looking on the raw data on the stack. I thought to do so by finding a "linked list" of saved EBP pointers.
Can I assume that a (standard and commonly used) C compiler (e.g. gcc) will always update and save EBP on a function call in the function prologue?
pushl %ebp
movl %esp, %ebp
Or are there cases where some compilers might skip that part for functions that don't get any parameters and don't have local variables?
The x86 calling conventions and the Wiki article on function prologue don't help much with that.
Is there any better method to divide a stack to stack frames just by looking on its raw data?
Thanks!
Some versions of gcc have a -fomit-frame-pointer optimization option. If memory serves, it can be used even with parameters/local variables (they index directly off of ESP instead of using EBP). Unless I'm badly mistaken, MS VC++ can do roughly the same.
Offhand, I'm not sure of a way that's anywhere close to universally applicable. If you have code with debug info, it's usually pretty easy -- otherwise though...
Even with the framepointer optimized out, stackframes are often distinguishable by looking through stack memory for saved return addresses instead. Remember that a function call sequence in x86 always consists of:
call someFunc ; pushes return address (instr. following `call`)
...
someFunc:
push EBP ; if framepointer is used
mov EBP, ESP ; if framepointer is used
push <nonvolatile regs>
...
so your stack will always - even if the framepointers are missing - have return addresses in there.
How do you recognize a return address ?
to start with, on x86, instruction have different lengths. That means return addresses - unlike other pointers (!) - tend to be misaligned values. Statistically 3/4 of them end not at a multiple of four.
Any misaligned pointer is a good candidate for a return address.
then, remember that call instructions on x86 have specific opcode formats; read a few bytes before the return address and check if you find a call opcode there (99% most of the time, it's five bytes back for a direct call, and three bytes back for a call through a register). If so, you've found a return address.
This is also a way to distinguish C++ vtables from return addresses by the way - vtable entrypoints you'll find on the stack, but looking "back" from those addresses you don't find call instructions.
With that method, you can get candidates for the call sequence out of the stack even without having symbols, framesize debugging information or anything.
The details of how to piece the actual call sequence together from those candidates are less straightforward though, you need a disassembler and some heuristics to trace potential call flows from the lowest-found return address all the way up to the last known program location. Maybe one day I'll blog about it ;-) though at this point I'd rather say that the margin of a stackoverflow posting is too small to contain this ...

How can I create a parallel stack and run a coroutine on it?

I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.
This is how I thought it:
Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.
That's for the initialization. Now, the following situation will indefinitely loop:
Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.
All this sounds good to me. However, there is a number of things I'm not exactly at ease with.
Apparently, on good ol' x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It's just jmp foo. Dummy.
I've mentioned a mysterious data structure a few times. Up to now I've assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I'd really like to understand how things work, I'm pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?
Thanks for reading my question textwall.
You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.
Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:
a:
call b
b:
pop rax
RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address :)
Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function A? That's probably not necessary.
Here's how I would approach the entire thing, trying to make it as simple as possible:
Create a structure coroutine_state that holds the following:
initarg
arg
registers (also contains the flags)
caller_registers
Create a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);
where coro_func is a pointer to the coroutine function body.
This function does the following:
allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine's stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure
Now we have another function:
void* coroutine_next(coroutine_state cs, void* arg)
where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.
This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine
Note that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.
The last function is declared as:
void coroutine_yield(void* ret);
This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next.
store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker's stack frame, and since the return value is passed in RAX, we returned arg. Let's just say if arg is NULL, then the coroutine has terminated, otherwise it's an arbitrary data structure.
So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.
The coroutine's function itself is declared:
void my_coro(coroutine_state cs)
cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.
Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)
Good learning reference: libcoroutine, especially their setjmp/longjmp implementation. I know its not fun to use an existing library, but you can at least get a general bearing on where you are going.
Simon Tatham has an interesting implementation of coroutines in C that doesn't require any architecture-specific knowledge or stack fiddling. It's not exactly what you're after, but I thought it might nonetheless be of at least academic interest.
boost.coroutine (boost.context) at boost.org does all for you

Resources