I am in the process of porting an application from x86 to x64. I am using Visual Studio 2009; most of the code is C++ and some portions are plain C. The __asm keyword is not supported when compiling towards x64 and our application contains a few portions of inline assembler. I did not write this code so I don't know exactly what et is supposed to do:
int CallStackSize() {
DWORD Frame;
PDWORD pFrame;
__asm
{
mov EAX, EBP
mov Frame, EAX
}
pFrame = (PDWORD)Frame;
/*... do stuff with pFrame here*/
}
EBP is the base pointer to the stack of the current function. Is there some way to obtain the stack pointer without using inline asm? I have been looking at the intrinsics that Microsoft offers as a substitute for inline asm but I could not find anything that gave me something usefull. Any ideas?
Andreas asked what stuff is done with pFrame. Here is the complete function:
int CallStackSize(DWORD frameEBP = 0)
{
DWORD pc;
int tmpint = 0;
DWORD Frame;
PDWORD pFrame, pPrevFrame;
if(!frameEBP) // No frame supplied. Use current.
{
__asm
{
mov EAX, EBP
mov Frame, EAX
}
}
else Frame = frameEBP;
pFrame = (PDWORD)Frame;
do
{
pc = pFrame[1];
pPrevFrame = pFrame;
pFrame = (PDWORD)pFrame[0]; // precede to next higher frame on stack
if ((DWORD)pFrame & 3) // Frame pointer must be aligned on a DWORD boundary. Bail if not so.
break;
if (pFrame <= pPrevFrame)
break;
// Can two DWORDs be read from the supposed frame address?
if(IsBadWritePtr(pFrame, sizeof(PVOID)*2))
break;
tmpint++;
} while (true);
return tmpint;
}
The variable pc is not used. It looks like this function walks down the stack until it fails. It assumes that it can't read outside the applications stack so when it fails it has measured the depth of the call stack. This code does not need to compile on _EVERY_SINGLE compiler out there. Just VS2009. The application does not need to run on EVERY_SINGLE computer out there. We have complete control of deployment since we install/configure it ourselves and deliver the whole thing to our customers.
The really right thing to do would be to rewrite whatever this function does so that it does not require access to the actual frame pointer. That is definitely bad behavior.
But, to do what you are looking for you should be able to do:
int CallStackSize() {
__int64 Frame = 0; /* MUST be the very first thing in the function */
PDWORD pFrame;
Frame++; /* make sure that Frame doesn't get optimized out */
pFrame = (PDWORD)(&Frame);
/*... do stuff with pFrame here*/
}
The reason this works is that in C usually the first thing a function does is save off the location of the base pointer (ebp) before allocating local variables. By creating a local variable (Frame) and then getting the address of if, we're really getting the address of the start of this function's stack frame.
Note: Some optimizations could cause the "Frame" variable to be removed. Probably not, but be careful.
Second Note: Your original code and also this code manipulates the data pointed to by "pFrame" when "pFrame" itself is on the stack. It is possible to overwrite pFrame here by accident and then you would have a bad pointer, and could get some weird behavior. Be especially mindful of this when moving from x86 to x64, because pFrame is now 8 bytes instead of 4, so if your old "do stuff with pFrame" code was accounting for the size of Frame and pFrame before messing with memory, you'll need to account for the new, larger size.
You can use the _AddressOfReturnAddress() intrinsic to determine a location in the current frame pointer, assuming it hasn't been completely optimized away. I'm assuming that the compiler will prevent that function from optimizing away the frame pointer if you explicitly refer to it. Or, if you only use a single thread, you can use the IMAGE_NT_HEADER.OptionalHeader.SizeOfStackReserve and IMAGE_NT_HEADER.OptionalHeader.SizeOfStackCommit to determine the main thread's stack size. See this for how to access the IMAGE_NT_HEADER for the current image.
I would also recommend against using IsBadWritePtr to determine the end of the stack. At the very least you will probably cause the stack to grow until you hit the reserve, as you'll trip a guard page. If you really want to find the current size of the stack, use VirtualQuery with the address you are checking.
And if the original use is to walk the stack, you can use StackWalk64 for that.
There is no guarantee that RBP (the x64's equivalent of EBP) is actually a pointer to the current frame in the callstack. I guess Microsoft decided that despite several new general purpose registers, that they needed another one freed up, so RBP is only used as framepointer in functions that call alloca(), and in certain other cases. So even if inline assembly were supported, it would not be the way to go.
If you just want to backtrace, you need to use StackWalk64 in dbghelp.dll. It's in the dbghelp.dll that's shipped with XP, and pre-XP there was no 64-bit support, so you shouldn't need to ship the dll with your application.
For your 32-bit version, just use your current method. Your own methods will likely be smaller than the import library for dbghelp, much less the actual dll in memory, so it is a definite optimization (personal experience: I've implemented a Glibc-style backtrace and backtrace_symbols for x86 in less than one-tenth the size of the dbghelp import library).
Also, if you're using this for in-process debugging or post-release crash report generation, I would highly recommend just working with the CONTEXT structure supplied to the exception handler.
Maybe some day I'll decide to target the x64 seriously, and figure out a cheap way around using StackWalk64 that I can share, but since I'm still targeting x86 for all my projects I haven't bothered.
Microsoft provides a library (DbgHelp) which takes care of the stack walking, and you should use instead of relying on assembly tricks. For example, if the PDB files are present, it can walk optimized stack frames too (those that don't use EBP).
CodeProject has an article which explains how to use it:
http://www.codeproject.com/KB/threads/StackWalker.aspx
If you need the precise "base pointer" then inline assembly is the only way to go.
It is, surprisingly, possible to write code that munges the stack with relatively little platform-specific code, but it's hard to avoid assembly altogether (depending on what you're doing).
If all you're trying to do is avoid overflowing the stack, you can just take the address of any local variable.
.code
PUBLIC getStackFrameADDR _getStackFrameADDR
getStackFrameADDR:
mov RAX, RBP
ret 0
END
Something like that could work for you.
Compile it with ml64 or jwasm and call it using this in your code
extern "C" void getstackFrameADDR(void);
Related
My understanding of the memory structure under C is a program's memory is split with the stack and the heap each growing from either end of the block conceivably allocating all of ram but obviously abstracted to some kind of OS memory fragment manager thing.
Stack designed for handling local variables (automatic storage) and heap for memory allocation (dynamic storage).
(Editor's note: there are C implementations where automatic storage doesn't use a "call stack", but this question assumes a normal modern C implementation on a normal CPU where locals do use the callstack if they can't just live in registers.)
Say I want to implement a stack data structure for some data parsing algorithm. Its lifetime and scope is limited to one function.
I can think of 3 ways to do such a thing, yet none of them seem to me as the cleanest way to go about this given the scenario.
My first though is to construct a stack in the heap, like C++ std::vector:
Some algorithm(Some data)
{
Label *stack = new_stack(stack_size_estimate(data));
Iterator i = some_iterator(data);
while(i)
{
Label label = some_label(some_iterator_at(i));
if (label_type_a(label))
{
push_stack(stack,label);
}
else if(label_type_b(label))
{
some_process(&data,label,pop_stack(stack));
}
i = some_iterator_next(i);
}
some_stack_cleanup(&data,stack);
delete_stack(stack);
return data;
}
This method is alright but it's wasteful in that the stack size is a guess and at any moment push_stack could call some internal malloc or realloc and cause irregular slowdowns. None of which are problems for this algorithm, but this construct seems better suited for applications in which a stack has to be maintained across multiple contexts. That isn't the case here; the stack is private to this function and is deleted before exit, just like automatic storage class.
My next thought is recursion. Because recursion uses the builtin stack this seems closer to what I want.
Some algorithm(Some data)
{
Iterator i = some_iterator(data);
return some_extra(algorithm_helper(extra_from_some(data),&i);
}
Extra algorithm_helper(Extra thing, Iterator* i)
{
if(!*i)
{return thing;}
{
Label label = some_label(some_iterator_at(i));
if (label_type_a(label))
{
*i = some_iterator_next(*i);
return algorithm_helper
( extra_process( algorithm_helper(thing,i), label), i );
}
else if(label_type_b(label))
{
*i = some_iterator_next(*i);
return extra_attach(thing,label);
}
}
}
This method saves me from writing and maintaining a stack. The code, to me, seems harder to follow, not that it matters to me.
My main issue with it is this is using way more space.
With stack frames holding copies of this Extra construct (which basically contains the Some data plus the actual bits wanted to be held in the stack) and unnecessary copies of the exact same iterator pointer in every frame: because it's "safer" then referencing some static global (and I don't know how to not do it this way). This wouldn't be a problem if the compiler did some clever tail recursion like thing but I don't know if I like crossing my fingers and hope my compiler is awesome.
The third way I can think of involves some kind of dynamic array thing that can grow on the stack being the last thing there written using some kind of C thing I don't know about.
Or an extern asm block.
Thinking about this, that's what I'm looking for but I don't see my self writing an asm version unless it's dead simple and I don't see that being easier to write or maintain despite it seeming simpler in my head. And obviously it wouldn't be portable across ISAs.
I don't know if I'm overlooking some feature or if I need to find another language or if I should rethink my life choices. All could be true... I hope it's just the first one.
I'm not opposed to using some library. Is there one, and if so how does it work? I didn't find anything in my searches.
I recently learned about Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference, but also I can't imagine them working that way.
tl; dr: use std::vector or an equivalent.
(Edited)
Regarding your opening statement: The days of segments are over. These days processes have multiple stacks (one for each thread), but all share one heap.
Regarding option 1: Instead of writing and maintaining a stack, and guessing its size, you should just literally use std::vector, or a C wrapper around it, or a C clone of it - in any case, use the 'vector' data structure.
Vector's algorithm is generally quite efficient. Not perfect, but generally good for many, may real-world use cases.
Regarding option 2: You are right, at least as long as the discussion is confined to C. In C, recursion is both wasteful and non-scalable. In some other languages, notably in functional languages, recursion is the way to express these algorithms, and tail-call optimization is part of the language definition.
Regarding option 3: The closest to that C thing you're looking for is alloca(). It allows you to grow the stack frame, and if the stack doesn't have enough memory, the OS will allocate it. However, it's going to be quite difficult to build a stack object around it, since there's no realloca(), as pointed out by #Peter Cordes.
The Other drawback is that stacks are still limited. On Linux, the stack is typically limited to 8 MB. This is the same scalability limitation as with recursion.
Regarding variable length arrays: VLAs are basically syntactic sugar, a notation convenience. Beyond syntax, they have the exact same capabilities of arrays (actually, even fewer, viz. sizeof() doesn't work), let alone the dynamic power of std::vector.
In practice, if you can't set a hard upper bound on possible size of less than 1kiB or so, you should normally just dynamically allocate. If you can be sure the size is that small, you could consider using alloca as the container for your stack.
(You can't conditionally use a VLA, it has to be in-scope. Although you could have its size be zero by declaring it after an if(), and set a pointer variable to the VLA address, or to malloc. But alloca would be easier.)
In C++ you'd normally std::vector, but it's dumb because it can't / doesn't use realloc (Does std::vector *have* to move objects when growing capacity? Or, can allocators "reallocate"?). So in C++ it's a tradeoff between more efficient growth vs. reinventing the wheel, although it's still amortized O(1) time. You can mitigate most of it with a fairly large reserve() up front, because memory you alloc but never touch usually doesn't cost anything.
In C you have to write your own stack anyway, and realloc is available. (And all C types are trivially copyable, so there's nothing stopping you from using realloc). So when you do need to grow, you can realloc the storage. But if you can't set a reasonable and definitely-large-enough upper bound on function entry and might need to grow, then you should still track capacity vs. in-use size separately, like std::vector. Don't call realloc on every push/pop.
Using the callstack directly as a stack data structure is easy in pure assembly language (for ISAs and ABIs that use a callstack i.e. "normal" CPUs like x86, ARM, MIPS, etc). And yes, in asm worth doing for stack data structures that you know will be very small, and not worth the overhead of malloc / free.
Use asm push or pop instructions (or equivalent sequence for ISAs without a single-instruction push / pop). You can even check the size / see if the stack data structure is empty by comparing against a saved stack-pointer value. (Or just maintain an integer counter along side your push/pops).
A very simple example is the inefficient way some people write int->string functions. For non-power-of-2 bases like 10, you generate digits in least-significant first order by dividing by 10 to remove them one at a time, with digit = remainder. You could just store into a buffer and decrement a pointer, but some people write functions that push in the divide loop and then pop in a second loop to get them in printing order (most-significant first). e.g. Ira's answer on How do I print an integer in Assembly Level Programming without printf from the c library? (My answer on the same question shows the efficient way which is also simpler once you grok it.)
It doesn't particularly matter that the stack grows towards the heap, just that there is some space you can use. And that stack memory is already mapped, and normally hot in cache. This is why we might want to use it.
Stack above heap happens to be true under GNU/Linux for example, which normally puts the main thread's user-space stack near the top of user-space virtual address space. (e.g. 0x7fff...) Normally there's a stack-growth limit that's much smaller than the distance from stack to heap. You want an accidental infinite recursion to fault early, like after consuming 8MiB of stack space, not drive the system to swapping as it uses gigabytes of stack. Depending on the OS, you can increase the stack limit, e.g. ulimit -s. And thread stacks are normally allocated with mmap, the same as other dynamic allocation, so there's no telling where they'll be relative to other dynamic allocation.
AFAIK it's impossible from C, even with inline asm
(Not safely, anyway. An example below shows just how evil you'd have to get to write this in C the way you would in asm. It basically proves that modern C is not a portable assembly language.)
You can't just wrap push and pop in GNU C inline asm statements because there's no way to tell the compiler you're modifying the stack pointer. It might try to reference other local variables relative to the stack pointer after your inline asm statement changed it.
Possibly if you knew you could safely force the compiler to create a frame pointer for that function (which it would use for all local-variable access) you could get away with modifying the stack pointer. But if you want to make function calls, many modern ABIs require the stack pointer to be over-aligned before a call. e.g. x86-64 System V requires 16-byte stack alignment before a call, but push/pop work in units of 8 bytes. OTOH, 32-bit ARM (and some 32-bit x86 calling conventions, e.g. Windows) don't have that feature so any number of 4-byte pushes would leave the stack correctly aligned for a function call.
I wouldn't recommend it, though; if you want that level of optimization (and you know how to optimize asm for the target CPU), it's probably safer to write your whole function in asm.
Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference
VLAs aren't resizable. After you do int VLA[n]; you're stuck with that size. Nothing you can do in C will guarantee you more memory that's contiguous with that array.
Same problem with alloca(size). It's a special compiler built-in function that (on a "normal" implementation) decrements the stack pointer by size bytes (rounded to a multiple of the stack width) and returns that pointer. In practice you can make multiple alloca calls and they will very likely be contiguous, but there's zero guarantee of that so you can't use it safely without UB. Still, you might get away with this on some implementations, at least for now until future optimizations notice the UB and assume that your code can't be reachable.
(And it could break on some calling conventions like x86-64 System V where VLAs are guaranteed to be 16-byte aligned. An 8-byte alloca there probably rounds up to 16.)
But if you did want to make this work, you'd maybe use long *base_of_stack = alloca(sizeof(long)); (the highest address: stacks grow downward on most but not all ISAs / ABIs - this is another assumption you'd have to make).
Another problem is that there's no way to free alloca memory except by leaving the function scope. So your pop has to increment some top_of_stack C pointer variable, not actually moving the real architectural "stack pointer" register. And push will have to see whether the top_of_stack is above or below the high-water mark which you also maintain separately. If so you alloca some more memory.
At that point you might as well alloca in chunks larger than sizeof(long) so the normal case is that you don't need to alloc more memory, just move the C variable top-of-stack pointer. e.g. chunks of 128 bytes maybe. This also solves the problem of some ABIs keeping the stack pointer over-aligned. And it lets the stack elements be narrower than the push/pop width without wasting space on padding.
It does mean we end up needing more registers to sort of duplicate the architectural stack pointer (except that the SP never increases on pop).
Notice that this is like std::vector's push_back logic, where you have an allocation size and an in-use size. The difference is that std::vector always copies when it wants more space (because implementations fail to even try to realloc) so it has to amortize that by growing exponentially. When we know growth is O(1) by just moving the stack pointer, we can use a fixed increment. Like 128 bytes, or maybe half a page would make more sense. We're not touching memory at the bottom of the allocation immediately; I haven't tried compiling this for a target where stack probes are needed to make sure you don't move RSP by more than 1 page without touching intervening pages. MSVC might insert stack probes for this.
Hacked up alloca stack-on-the-callstack: full of UB and miscompiles in practice with gcc/clang
This mostly exists to show how evil it is, and that C is not a portable assembly language. There are things you can do in asm you can't do in C. (Also including efficiently returning multiple values from a function, in different registers, instead of a stupid struct.)
#include <alloca.h>
#include <stdlib.h>
void some_func(char);
// assumptions:
// stack grows down
// alloca is contiguous
// all the UB manages to work like portable assembly language.
// input assumptions: no mismatched { and }
// made up useless algorithm: if('}') total += distance to matching '{'
size_t brace_distance(const char *data)
{
size_t total_distance = 0;
volatile unsigned hidden_from_optimizer = 1;
void *stack_base = alloca(hidden_from_optimizer); // highest address. top == this means empty
// alloca(1) would probably be optimized to just another local var, not necessarily at the bottom of the stack frame. Like char foo[1]
static const int growth_chunk = 128;
size_t *stack_top = stack_base;
size_t *high_water = alloca(growth_chunk);
for (size_t pos = 0; data[pos] != '\0' ; pos++) {
some_func(data[pos]);
if (data[pos] == '{') {
//push_stack(stack, pos);
stack_top--;
if (stack_top < high_water) // UB: optimized away by clang; never allocs more space
high_water = alloca(growth_chunk);
// assert(high_water < stack_top && "stack growth happened somewhere else");
*stack_top = pos;
}
else if(data[pos] == '}')
{
//total_distance += pop_stack(stack);
size_t popped = *stack_top;
stack_top++;
total_distance += pos - popped;
// assert(stack_top <= stack_base)
}
}
return total_distance;
}
Amazingly, this seems to actually compile to asm that looks correct (on Godbolt), with gcc -O1 for x86-64 (but not at higher optimization levels). clang -O1 and gcc -O3 optimize away the if(top<high_water) alloca(128) pointer compare so this is unusable in practice.
< pointer comparison of pointers derived from different objects is UB, and it seems even casting to uintptr_t doesn't make it safe. Or maybe GCC is just optimizing away the alloca(128) based on the fact that high_water = alloca() is never dereferenced.
https://godbolt.org/z/ZHULrK shows gcc -O3 output where there's no alloca inside the loop. Fun fact: making volatile int growth_chunk to hide the constant value from the optimizer makes it not get optimized away. So I'm not sure it's pointer-compare UB that's causing the issue, it's more like accessing memory below the first alloca instead of dereferencing a pointer derived from the second alloca that gets compilers to optimize it away.
# gcc9.2 -O1 -Wall -Wextra
# note that -O1 doesn't include some loop and peephole optimizations, e.g. no xor-zeroing
# but it's still readable, not like -O1 spilling every var to the stack between statements.
brace_distance:
push rbp
mov rbp, rsp # make a stack frame
push r15
push r14
push r13 # save some call-preserved regs for locals
push r12 # that will survive across the function call
push rbx
sub rsp, 24
mov r12, rdi
mov DWORD PTR [rbp-52], 1
mov eax, DWORD PTR [rbp-52]
mov eax, eax
add rax, 23
shr rax, 4
sal rax, 4 # some insane alloca rounding? Why not AND?
sub rsp, rax # alloca(1) moves the stack pointer, RSP, by whatever it rounded up to
lea r13, [rsp+15]
and r13, -16 # stack_base = 16-byte aligned pointer into that allocation.
sub rsp, 144 # alloca(128) reserves 144 bytes? Ok.
lea r14, [rsp+15]
and r14, -16 # and the actual C allocation rounds to %16
movzx edi, BYTE PTR [rdi] # data[0] check before first iteration
test dil, dil
je .L7 # if (empty string) goto return 0
mov ebx, 0 # pos = 0
mov r15d, 0 # total_distance = 0
jmp .L6
.L10:
lea rax, [r13-8] # tmp_top = top-1
cmp rax, r14
jnb .L4 # if(tmp_top < high_water)
sub rsp, 144
lea r14, [rsp+15]
and r14, -16 # high_water = alloca(128) if body
.L4:
mov QWORD PTR [r13-8], rbx # push(pos) - the actual store
mov r13, rax # top = tmp_top completes the --top
# yes this is clunky, hopefully with more optimization gcc would have just done
# sub r13, 8 and used [r13] instead of this RAX tmp
.L5:
add rbx, 1 # loop condition stuff
movzx edi, BYTE PTR [r12+rbx]
test dil, dil
je .L1
.L6: # top of loop body proper, with 8-bit DIL = the non-zero character
movsx edi, dil # unofficial part of the calling convention: sign-extend narrow args
call some_func # some_func(data[pos]
movzx eax, BYTE PTR [r12+rbx] # load data[pos]
cmp al, 123 # compare against braces
je .L10
cmp al, 125
jne .L5 # goto loop condition check if nothing special
# else: it was a '}'
mov rax, QWORD PTR [r13+0]
add r13, 8 # stack_top++ (8 bytes)
add r15, rbx # total += pos
sub r15, rax # total -= popped value
jmp .L5 # goto loop condition.
.L7:
mov r15d, 0
.L1:
mov rax, r15 # return total_distance
lea rsp, [rbp-40] # restore stack pointer to point at saved regs
pop rbx # standard epilogue
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
This is like you'd do for a dynamically allocated stack data structure except:
it grows downward like the callstack
we get more memory from alloca instead of realloc. (realloc can also be efficient if there's free virtual address space after the allocation). C++ chose not to provide a realloc interface for their allocator, so std::vector always stupidly allocs + copies when more memory is required. (AFAIK no implementations optimize for the case where new hasn't been overridden and use a private realloc).
it's totally unsafe and full of UB, and fails in practice with modern optimizing compilers
the pages will never get returned to the OS: if you use a large amount of stack space, those pages stay dirty indefinitely.
If you can choose a size that's definitely large enough, you could use a VLA of that size.
I'd recommend starting at the top and going downward, to avoid touching memory far below the currently in-use region of the callstack. That way, on an OS that doesn't need "stack probes" to grow the stack by more than 1 page, you might avoid ever touching memory far below the stack pointer. So the small amount of memory you do end up using in practice might all be within an already mapped page of the callstack, and maybe even cache lines that were already hot if some recent deeper function call already used them.
If you do use the heap, you can minimize realloc costs by doing a pretty large allocation. Unless there was a block on the free-list that you could have gotten with a smaller allocation, in general over-allocating have very low cost if you never touch parts you didn't need, especially if you free or shrink it before doing any more allocations.
i.e. don't memset it to anything. If you want zeroed memory, use calloc which may be able to get zeroed pages from the OS for you.
Modern OSes use lazy virtual memory for allocations so the first time you touch a page, it typically has to page-fault and actually get wired into the HW page tables. Also a page of physical memory has to get zeroed to back this virtual page. (Unless the access was a read, then Linux will copy-on-write map the page to a shared physical page of zeros.)
A virtual page you never even touch will just be a larger size in an extent book-keeping data structure in the kernel. (And in the user-space malloc allocator). This doesn't add anything to the cost of allocating it, or to freeing it, or using the earlier pages that you do touch.
Not really a true answer, but a bit too long for a mere comment.
In fact, the image of the stack and the heap and growing each towards the other is over simplistic. It used to be true with the 8086 processor series (at least in some memory models) where the stack and the heap shared one single segment of memory, but even the old Windows 3.1 system came with some API functions allowing to allocate memory outside of the heap (search for GlobalAlloc opposited to LocalAlloc), provided the processor was at least a 80286.
But modern systems all use virtual memory. With virtual memory there is no longer a nice consecutive segment shared by the heap and the stack, and the OS can give memory wherever it want (provided of course it can find free memory somewhere). But the stack has still to be a consecutive segment. Because of that, the size of that segment is determinated at build time and is fixed, while the size of the heap is only constrained by the maximum memory the system can allocate to the process. That is the reason why many recommend to use the stack only for small data structures and always allocate large ones. Furthermore, I know no portable way for a program to know its stacksize, not speaking of its free stacksize.
So IMHO, what is important here is to wonder whether the size of your stack is small enough. If it can exceed a small limit, just go for allocated memory, because it will be both easier and more robust. Because you can (and should) test for allocation errors, but a stack overflow is always fatal.
Finally, my advice is not to try to use the system stack for your own dedicated usage, even if limited to one single function, except if you can cleanly ask for a memory array in the stack and build your own stack management over it. Using assembly language to directly use the underlying stack will add a lot of complexity (not speaking of lost portability) for a hypothetic minimal gain. Just don't. Even if you want to use assembly language instructions for a low level optimization of your stack, my advice is to use a dedicated memory segment and leave the system stack for the compiler.
My experience says that the more complexity you put into your code, the harder it will be to maintain, and the less robust.
So just follow best practices and only use low level optimizations when and where you cannot avoid them.
Linux x86_64.
gcc 5.x
I was studying the output of two codes, with -fomit-frame-pointer and without (gcc at "-O3" enables that option by default).
pushq %rbp
movq %rsp, %rbp
...
popq %rbp
My question is :
If I globally disable that option, even for, at the extreme, compiling an operating system, is there a catch ?
I know that interrupts use that information, so is that option good only for user space ?
The compilers always generate self consistent code, so disabling the frame pointer is fine as long as you don't use external/hand crafted code that makes some assumption about it (e.g. by relying on the value of rbp for example).
The interrupts don't use the frame pointer information, they may use the current stack pointer for saving a minimal context but this is dependent on the type of interrupt and OS (an hardware interrupt uses a Ring 0 stack probably).
You can look at Intel manuals for more information on this.
About the usefulness of the frame pointer:
Years ago, after compiling a couple of simple routines and looking at the generated 64 bit assembly code I had your same question.
If you don't mind reading a whole lot of notes I have written for myself back then, here they are.
Note: Asking about the usefulness of something is a little bit relative. Writing assembly code for the current main 64 bit ABIs I found my self using the stack frame less and less. However this is just my coding style and opinion.
I like using the frame pointer, writing the prologue and epilogue of a function, but I like direct uncomfortable answers too, so here's how I see it:
Yes, the frame pointer is almost useless in x86_64.
Beware it is not completely useless, especially for humans, but a compiler doesn't need it anymore.
To better understand why we have a frame pointer in the first place it is better to recall some history.
Back in the real mode (16 bit) days
When Intel CPUs supported only "16 bit mode" there were some limitation on how to access the stack, particularly this instruction was (and still is) illegal
mov ax, WORD [sp+10h]
because sp cannot be used as a base register. Only a few designated registers could be used for such purpose, for example bx or the more famous bp.
Nowadays it's not a detail everybody put their eyes on but bp has the advantage over other base register that by default it implicitly implicates the use of ss as a segment/selector register, just like implicit usages of sp (by push, pop, etc), and like esp does on later 32-bit processors.
Even if your program was scattered all across memory with each segment register pointing to a different area, bp and sp acted the same, after all that was the intent of the designers.
So a stack frame was usually necessary and consequently a frame pointer.
bp effectively partitioned the stack in four parts: the arguments area, the return address, the old bp (just a WORD) and the local variables area. Each area being identified by the offset used to access it: positive for the arguments and return address, zero for the old bp, negative for the local variables.
Extended effective addresses
As the Intel CPUs were evolving, the more extensive 32-bit addressing modes were added.
Specifically the possibility to use any 32-bit general-purpose register as a base register, this includes the use of esp.
Being instructions like this
mov eax, DWORD [esp+10h]
now valid, the use of the stack frame and the frame pointer seems doomed to an end.
Likely this was not the case, at least in the beginnings.
It is true that now it is possible to use entirely esp but the separation of the stack in the mentioned four areas is still useful, especially for humans.
Without the frame pointer a push or a pop would change an argument or local variable offset relative to esp, giving form to code that look non intuitive at first sight. Consider how to implement the following C routine with cdecl calling convention:
void my_routine(int a, int b)
{
return my_add(a, b);
}
without and with a framestack
my_routine:
push DWORD [esp+08h]
push DWORD [esp+08h]
call my_add
ret
my_routine:
push ebp
mov ebp, esp
push DWORD [ebp+0Ch]
push DWORD [ebp+08h]
call my_add
pop ebp
ret
At first sight it seems that the first version pushes the same value twice. It actually pushes the two separate arguments however, as the first push lowers esp so the same effective address calculation points the second push to a different argument.
If you add local variables (especially lots of them) then the situation quickly becomes hard to read: Does mov eax, [esp+0CAh] refer to a local variable or to an argument? With a stack frame we have fixed offsets for the arguments and local variables.
Even the compilers at first still preferred the fixed offsets given by the use of the frame base pointer. I see this behavior changing first with gcc.
In a debug build the stack frame effectively adds clarity to the code and makes it easy for the (proficient) programmer to follow what is going on and, as pointed out in the comment, lets them recover the stack frame more easily.
The modern compilers however are good at math and can easily keep count of the stack pointer movements and generate the appropriate offsets from esp, omitting the stack frame for faster execution.
When a CISC requires data alignment
Until the introduction of SSE instructions the Intel processors never asked much from the programmers compared to their RISC brothers.
In particular they never asked for data alignment, we could access 32 bit data on an address not a multiple of 4 with no major complaint (depending on the DRAM data width, this may result on increased latency).
SSE used 16 bytes operands that needed to be accessed on 16 byte boundary, as the SIMD paradigm becomes implemented efficiently in the hardware and becomes more popular the alignment on 16 byte boundary becomes important.
The main 64 bit ABIs now require it, the stack must be aligned on paragraphs (ie, 16 bytes).
Now, we are usually called such that after the prologue the stack is aligned, but suppose we are not blessed with that guarantee, we would need to do one of this
push rbp push rbp
mov rbp, rsp mov rbp, rsp
and spl, 0f0h sub rsp, xxx
sub rsp, 10h*k and spl, 0f0h
One way or another the stack is aligned after these prologues, however we can no longer use a negative offset from rbp to access local vars that need alignment, because the frame pointer itself is not aligned.
We need to use rsp, we could arrange a prologue that has rbp pointing at the top of an aligned area of local vars but then the arguments would be at unknown offsets.
We can arrange a complex stack frame (maybe with more than one pointer) but the key of the old fashioned frame base pointer was its simplicity.
So we can use the frame pointer to access the arguments on the stack and the stack pointer for the local variables, fair enough.
Alas the role of stack for arguments passing has been reduced and for a small number of arguments (currently four) it is not even used and in the future it will probably be used even less.
So we don't use the frame pointer for local variables (mostly), nor for the arguments (mostly), for what do we use it?
It saves a copy of the original rsp, so to restore the stack pointer at function exit, a mov is enough. If the stack is aligned with an and, which is not invertible, an original copy is necessary.
Actually some ABIs guarantee that after the standard prologue the stack is aligned thereby allowing us to use the frame pointer as usual.
Some variables don't need alignment and can be accessed with an unaligned frame pointer, this is usually true for hand crafted code.
Some functions require more than four parameters.
Summary
The frame pointer is a vestigial paradigm from 16 bit programs that has proven itself still useful on 32 bit machines because of its simplicity and clarity when accessing local variables and arguments.
On 64 bit machines however the strict requirements vanish most of the simplicity and clarity, the frame pointer remains used in debug mode however.
On the fact that the frame pointer can be used to make fun things: it is true I guess, I've never seen such code but I can image how it would work.
I, however, focused on the housekeeping role of the frame pointer as this is the way I always have seen it.
All the crazy things can be done with any pointer set to the same value of the frame pointer, I give the latter a more "special" role.
VS2013 for example sometimes uses rdi as a "frame pointer", but I don't consider it a real frame pointer if it doesn't use rbp/ebp/bp.
To me the use of rdi means a Frame Pointer Omission optimization :)
Prior Knowledge
I've read and understand a bit about stacks and data structures, but couldn't find the answer to this specific question. I know that any programmer worth anything will implement security exception handling beyond the default in their programs.
Situation
I would like to understand how a program could be written in C that establishes a relatively robust stack with exception handling for some arbitrary case S.
My goal is to discern from this very specific information why (from what I understand) it is always possible to exploit the SEH in a program and execute arbitrary code.
The issue is not that I do not understand the concept of overflowing a buffer - I do not understand why (in very specific, CPU architecture related reasons) the security implemented on the stack (canaries, etc) cannot sufficiently address these issues (like heap overflows cannot be stopped?).
Reference Information
"There is no sane way to alter the layout of data within a structure; structures are expected to be the same between modules, especially with shared libraries. Any data in a structure after a buffer is impossible to protect with canaries; thus, programmers must be very careful about how they organize their variables and use their structures. In C and C++, structures with buffers should either be malloc()ed or obtained with new."
- via http://en.wikipedia.org/wiki/Buffer_overflow_protection#Implementations
Also:
http://blogs.msdn.com/b/michael_howard/archive/2006/08/16/702707.aspx
If anyone knows a good resource for understanding this material or can provide a code snippet, I would be very grateful.
The security of the stack can be made secure enough to cope with stack overflows:
http://msdn.microsoft.com/en-us/library/9a89h429(VS.80).aspx
The question as to why you need registered exception handlers (safe-SEH) and why normal exception handlers won't cut it is because of the case where you get very large stack overflows.
Let's suppose I have the function which begins
try {
char buffer[N];
strcpy(&buffer, &attacker);
} __except(...) { }
This might translate into the assembly code
push ebp
mov ebp, esp
; GS if you want to here
; install the exception handler:
push lbl_Exceptionhandler
push dword ptr [fs:0]
mov dword ptr[fs:0], esp
; setup the locals inside the stack
sub esp, LOCALS
; GS if you want to here
; call strcpy
lea ecx [ebp + offset_to_buffer];
push ecx
lea edx, [ebp + offset_to_attacker]
push edx
call _strcpy
add esp, 8
; uninstall the locals
mov esp, ebp
; uninstall the exception handler
pop dword ptr [fs:0]
; return
pop ebp
; optionally check GS cookies that we might have also inserted at any point in this function.
call _checksecuritycookie
ret
Or in other words, the stack looks like this:
RET PTR
/GS1
SAVED EBP
/GS2
SAVED FS:0
/GS3
LOCAL char buffer[N]
GS1, GS2 and GS3 are locations where stack canaries might choose to write the stack cookie. Note that the cookie will only be checked at the end of the function (This is important in computer security. When you introduce a check you need to think not only whether the check will detect the overflow, but whether it will detect it before it is already too late; and that requires thinking where the check will take place. For stack cookies, the cookie is only inspected on function exits, because stack cookies are generally only there to protect the return address, not to protect local variables).
The problem with normal exception handlers is what happens if the attacker buffer is really huge. Let's suppose it's so huge it destroys the entire stack, writes onto the guard page for the thread and triggers a fault?
Well, the kernel calls back into ntdll and tells it to sort it's process out, and ntdll's first port of call is to see if there are any registered exception handlers. Now how does it find what exception handler to call? Well, it looks at fs:0, which points to the exception handler on the stack, and calls the exception handler pointer. Except that exception handler's on the stack that the attacker just destroyed.
Oops. Now the attacker has control of EIP and you lose.
Safe-SEH solves this problem by noting that the list of exception handlers that you might ever want to call is in fact a finite list entirely determined at compile time. By burning this list into the PE file itself, ntdll has a chance to double check that the exception handler that it should jump to is, in fact, a real exception handler and not the cause of some evil attacker's plot to take over your EIP.
There's a cost to Safe-SEH (hence why it is opt in), but that cost is that it becomes more expensive to catch an exception, since ntdll will now do more work before your exception handler takes over.
Despite this, my advice is that SafeSEH should always be on. Making it easier to lose your customer's credit card details because your app is critically performance dependent on the speed of throwing exceptions suggests a mentality so horrendously broken in the developers mind that they should be immediately put into a cannon and fired into the sun to avoid their awful code from damaging society.
The normal way to implement a stack in C is to use a linked list. For example:
struct stack_entry {
struct stack_entry *previous;
/* other fields for the actual data */
}
struct stack_entry *stack_top = NULL;
void push(struct stack_entry *entry) {
entry->previous = stack_top;
stack_top = entry;
}
struct stack_entry *pop(void) {
struct stack_entry *entry;
entry = stack_top;
if(entry != NULL) stack_top = entry->previous;
return entry;
}
This is as robust and difficult to exploit as any other normal code.
If you're not implementing a stack in C, but are implementing a C compiler (in any language), then..
It would be possible to create a C compiler that detects programming errors and generates safe code. For example, for every read or write the compiler could insert checks to ensure that the read or write is contained within one storage area (e.g. and you're not trying to write 4 bytes at the end of an array of char or something); where a signal is raised (e.g. "SIGSEGV") if one of these checks fail.
Due to the nature of C, these checks would involve scanning things like the heap, and it'd require more code to be insert to track the sizes of things on the stack.
The main reason this hasn't been implemented is that it'd create massive performance problems, and would therefore defeat the purpose of using C to begin with.
However, there are debugging tools (e.g. valgrind) that do this type of checking by running the application inside a virtual machine (where the virtual machine tracks the sizes of storage areas and is able to check reads/writes before they're performed).
I would like to divide a stack to stack-frames by looking on the raw data on the stack. I thought to do so by finding a "linked list" of saved EBP pointers.
Can I assume that a (standard and commonly used) C compiler (e.g. gcc) will always update and save EBP on a function call in the function prologue?
pushl %ebp
movl %esp, %ebp
Or are there cases where some compilers might skip that part for functions that don't get any parameters and don't have local variables?
The x86 calling conventions and the Wiki article on function prologue don't help much with that.
Is there any better method to divide a stack to stack frames just by looking on its raw data?
Thanks!
Some versions of gcc have a -fomit-frame-pointer optimization option. If memory serves, it can be used even with parameters/local variables (they index directly off of ESP instead of using EBP). Unless I'm badly mistaken, MS VC++ can do roughly the same.
Offhand, I'm not sure of a way that's anywhere close to universally applicable. If you have code with debug info, it's usually pretty easy -- otherwise though...
Even with the framepointer optimized out, stackframes are often distinguishable by looking through stack memory for saved return addresses instead. Remember that a function call sequence in x86 always consists of:
call someFunc ; pushes return address (instr. following `call`)
...
someFunc:
push EBP ; if framepointer is used
mov EBP, ESP ; if framepointer is used
push <nonvolatile regs>
...
so your stack will always - even if the framepointers are missing - have return addresses in there.
How do you recognize a return address ?
to start with, on x86, instruction have different lengths. That means return addresses - unlike other pointers (!) - tend to be misaligned values. Statistically 3/4 of them end not at a multiple of four.
Any misaligned pointer is a good candidate for a return address.
then, remember that call instructions on x86 have specific opcode formats; read a few bytes before the return address and check if you find a call opcode there (99% most of the time, it's five bytes back for a direct call, and three bytes back for a call through a register). If so, you've found a return address.
This is also a way to distinguish C++ vtables from return addresses by the way - vtable entrypoints you'll find on the stack, but looking "back" from those addresses you don't find call instructions.
With that method, you can get candidates for the call sequence out of the stack even without having symbols, framesize debugging information or anything.
The details of how to piece the actual call sequence together from those candidates are less straightforward though, you need a disassembler and some heuristics to trace potential call flows from the lowest-found return address all the way up to the last known program location. Maybe one day I'll blog about it ;-) though at this point I'd rather say that the margin of a stackoverflow posting is too small to contain this ...
I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); //Â execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.
This is how I thought it:
Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.
That's for the initialization. Now, the following situation will indefinitely loop:
Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.
All this sounds good to me. However, there is a number of things I'm not exactly at ease with.
Apparently, on good ol' x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It's just jmp foo. Dummy.
I've mentioned a mysterious data structure a few times. Up to now I've assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I'd really like to understand how things work, I'm pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?
Thanks for reading my question textwall.
You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.
Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:
a:
call b
b:
pop rax
RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address :)
Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function A? That's probably not necessary.
Here's how I would approach the entire thing, trying to make it as simple as possible:
Create a structure coroutine_state that holds the following:
initarg
arg
registers (also contains the flags)
caller_registers
Create a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);
where coro_func is a pointer to the coroutine function body.
This function does the following:
allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine's stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure
Now we have another function:
void* coroutine_next(coroutine_state cs, void* arg)
where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.
This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine
Note that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.
The last function is declared as:
void coroutine_yield(void* ret);
This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next.
store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker's stack frame, and since the return value is passed in RAX, we returned arg. Let's just say if arg is NULL, then the coroutine has terminated, otherwise it's an arbitrary data structure.
So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.
The coroutine's function itself is declared:
void my_coro(coroutine_state cs)
cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.
Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)
Good learning reference: libcoroutine, especially their setjmp/longjmp implementation. I know its not fun to use an existing library, but you can at least get a general bearing on where you are going.
Simon Tatham has an interesting implementation of coroutines in C that doesn't require any architecture-specific knowledge or stack fiddling. It's not exactly what you're after, but I thought it might nonetheless be of at least academic interest.
boost.coroutine (boost.context) at boost.org does all for you