I was watching a video about CUDA and the Barnes-Hut algorithm where it was stated that it is necessary to place a depth limit on the tree for the GPU, and then the idea popped into my head about possibly doing recursion in the heap.
Basically, I am wondering just that: Is it possible to allocate memory from the heap and use that as a temporary "stack" in which to place function calls for the recursive function in question to somewhat delay a stack overflow?
If so, how could it be implemented, would we allocate space for a pointer to the function? I assume it would involve storing function address in the heap however I'm not too sure.
[edit] I just wanted to add that this is purely a theoretical question, and i would imagine that doing this would cause the program to slow down once using the heap.
[edit] As per request, the compiler I am using is GCC 4.8.4 on Ubuntu 14.04 (64-bit)
Sure. This is called continuation-passing style. The standard library supports it with setjmp() and longjmp(), and stores the information needed to restore control to an earlier point in a structure called jmp_buf. (There are several restrictions on where you can restore from.) You would store them in a stack, which is just a LIFO queue.
A more general approach is to run the program as a state machine and store the information needed to backtrack the program state, called a continuation, in a data structure called a trampoline. A common reason to want to do this is to get the equivalent of tail-recursion in an implementation that doesn’t optimize it and might chew up lots of stack space. One real-world application where someone I know is currently writing a trampoline is a GLL parser where the grammar is represented as a directed graph, the result of the parse is a shared packed parse forest, and the parser often needs to backtrack to try a different rule.
Continuation-passing and trampolines seem to be regarded as fancy style because they come from the world of functional programming, while longjmp() is regarded as an ugly low-level hack and even the Linux man page says not to use it.
You can simulate this by implementing your own heap-based stack as an array of structures, with each structure representing a stack frame that holds the equivalent of parameters and local variables. Instead of a function calling itself recursively, the function loops and each "call" explicitly pushes a new frame onto the stack.
I did exactly this years ago while attempting to solve a simple board game. The program was originally recursive, and it took forever to run. I changed it to the above structure, and this made it simple to make the app interruptible/restartable. When interrupted the app dumped its "stack" to a state file. When restarted, the app loaded the state file and continued where it left off.
This does required some care if the stack frame structure contains embedded pointers, but it's not insurmountable.
Related
I read that the dynamic link points to the previous activation record ( aka a "stack frame"), so it makes sense in dynamic scoped programming language. But in static scoped programming language, why the access link (which points to activation record of function in one lower nesting level) isn't enough?
And specifically in C - why access link is not needed? And why dynamic link is needed?
I will use this nomenclature which is more familiar to me:
Activation record: Stack frame
Dynamic link: [saved] frame pointer
So, I interpret your question as: Why are frame pointers needed?[1]
A frame pointer is not required.
Some compilers (e.g. Green Hills C++, GCC with -O2) don’t usually generate one or can be asked not to generate it (MSVC, GCC).
That said, it does of course have its benefits:
Easy traversing of the call stack: generating a stack trace is as easy as traversing a linked list where the frame pointer forms the head. Makes implementing stack traces and debuggers much easier.
Easier code generation: stack variables can be referenced by indexing the frame pointer instead of the all-the-time-changing stack pointer. The stack pointer changes with each push/pop, the frame pointer stays constant within a function (between the prologue/epilogue)
Should things go awry, stack unwinding can be done using the frame pointer. This is how Borland’s structured exception handling (SEH) works.
Streamlines stack management: Particularly implementations of setjmp(3), alloca(3) and C99-VLA may (and usually do) depend on it.
Drawbacks:
Register usage: a x86 got only 8 general purpose registers. one of these would need to be dedicated fully for holding the frame pointer.
Overhead: the prologue/epilogue is generated for every function.
But as you noticed, a compiler can generate perfectly fine code without having to maintain a frame pointer.
[1] If that's not what's meant, please elaborate.
Your question might be related to the -fomit-frame-pointer optimizing option of GCC, then see this.
BTW, many people are naming call frames (in the call stack) what you name activation record. The notion of continuation, and also of continuation passing style and A-normal forms are closely related.
A dynamic link is really only useful for nested functions (and perhaps closures), and standard C does not have them. Some people speak of display links. Standard C don't have nested functions, so don't need any related trick (display link, trampoline, ...).
The GCC compiler provides nested functions as a C language extension, and implement them with dynamic links on activation records, very closely to what you are thinking about. Read also the wikipages on man or boy test and on trampoline.
This question was prompted by studying the C language.
I have seen in my data structure course that in many cases where recursion proves to be a quick and easy solution (e.g. quicksort, traversal of binary search tree, etc.) it has been explicitly mentioned that using a self-created stack is a better idea.
The reason given is that recursion requires many function calls, and function calls are 'slower'.
But how does using a self-created stack prove any better as any function call makes use of the stack?
There are two real reasons that self-created stacks can be more efficient than the execution stack:
The execution stack is meant to handle a generalized case of calling new functions. That means it has a lot of overhead: it has to contain pointers to the preceding function, it has to contain pointers to values on the heap, and a number of other bookkeeping items. This may be more than you need for your specific calculation if your calculation is, indeed, specific. All the additional management decreases efficiency. In situations where the function is very heavy and there are relatively few calls, this is fine. In a situation where the function itself is simpler, but there are many function calls, the cost of overhead increases disproportionately.
A generalized stack hides a lot of details from you, preventing you from taking advantage of directly referencing a different part of the stack. For instance, the root of the stack is hidden from you. Lets say you're searching for a particular value in a large tree, using recursion. At some point you're a thousand nodes deep in the tree and you find the value. Success! But then you have to climb out of the tree one function at a time: meaning at least a thousand calls just to return the value. (*) Instead, if you've written your own stack you can return immediately. Or, suppose you have an algorithm that, at certain nodes in the tree, requires you to back up n stack frames before continuing execution. Using the generalized stack frame you are required to back out of those frames until you find the one you're looking for. If you designed the stack specifically for your algorithm, you can provide a mechanism to immediately jump to the point of execution in one instruction rather than n.
Thus, it can behoove you to write your own stack when you can either take advantage of throwing out parts of the generalized stack frame mechanism that you don't need but cost time, or the algorithm being written can take advantage of moving rapidly through the stack if it knows what it is doing (where a generalized stack 'protects' you from doing this by hiding it's abstraction). Remember that function calls are just a particular generalized abstraction for handling code: if for some reason they are adding a constraint that makes your code awkward, you can probably create a stripped down version that more directly addresses your need.
You might also create your own stack if the memory allotted to your stack is small compared to the number of times you must recurse, such as if you have a very large input domain or if you're running on specialized small-footprint hardware or a similar situation. Again, though, it depends on the algorithm you're running and how the generalized stack solution helps or hinders it.
(*) Tail recursion can often help, but because tail recursion is by definition only entering a stack frame one level deeper, I'm assuming you're talking about a situation where that is not strictly possible.
The self-created stack means that you push just a few of the important variables to the self-created stack.
When you use recursion, the function header and state of all variables will be pushed to the stack.
If the depth of the recursion is high, using case 2 means that the memory will be exhausted pretty quickly.
Generally a function call has some overhead before anything inside the function is done. The code generated for a function call basically ensures that you'll find everything like you left it when you, well, return; while it gives you at the same time a clean empty environment inside the called function. In fact this convenience is one of the most crucial services C provides, next to the standard library. (In many other respects C is a mere macro assembler -- did you ever look at a C source and the generated assembler side by side?).
In particular usually a few registers must be saved, and possibly parameters must be copied on the call stack. The effort required depends on the processor, compiler and calling convention. For example, parameters and return values may be in registers, not on the stack (but then the parameters must be saved anyway for each recursive call, don't they?).
The overhead is relatively large if the function is small; that's why inlining can be powerful. Inlining recursive function calls is similar to loop unrolling. I don't know whether current compilers do that on a regular basis (they might). But it's risky to rely on the compiler, so I would avoid recursive implementations of trivial functions, like computing the factorial, if speed is important.
Maybe using a self created stack was not necessarily recommended for performance reasons. One good reason I can think of is that the "regular" stack may be of fixed size (often 1MB), so for example sorting large amounts of data would cause a stack overflow.
Is there a language where we can enable a mechanism that allocates new stack space on the heap when the original stack space is exceeded?
I remember doing a lab in my university where we fiddled with inline assembly in C to implement a heap-based extensible stack, so I know it should be possible in principle.
I understand it may be useful to get a stack overflow error when developing an app because it terminates a crazy infinite recursion quickly without making your system take lots of memory and begin to swap.
However, when you have a finished well-tested application that you want to deploy and you want it to be as robust as possible (say it's a pretty critical program running on a desktop computer), it would be nice to know it won't miserably fail on some other systems where the stack is more limited, where some objects take more space, or if the program is faced with a very particular case requiring more stack memory than in your tests.
I think it's because of these pitfalls that recursion is usually avoided in production code. But if we had a mechanism for automatic stack expansion in production code, we'd be able to write more elegant programs using recursion knowing it won't unexpectedly segfault while the system has 16 gigabytes of heap memory ready to be used...
There is precedent for it.
The runtime for GHC, a Haskell compiler, uses the heap instead of the stack. The stack is only used when you call into foreign code.
Google's Go implementation uses segmented stacks for goroutines, which enlarge the stack as necessary.
Mozilla's Rust used to use segmented stacks, although it was decided that it caused more problems than it solved (see [rust-dev] Abandoning segmented stacks in Rust).
If memory serves, some Scheme implementations put stack frames on the heap, then garbage collected the frames just like other objects.
In traditional programming styles for imperative languages, most code will avoid calling itself recursively. Stack overflows are rarely seen in the wild, and they're usually triggered by either sloppy programming or by malicious input--especially to recursive descent parsers and the like, which is why some parsers reject code when the nesting exceeds a threshold.
The traditional advice for avoiding stack overflows in production code:
Don't write recursive code. (Example: rewrite a search algorithm to use an explicit stack.)
If you do write recursive code, prove that recursion is bounded. (Example: searching a balanced tree is bounded by the logarithm of the size of the tree.)
If you can't prove that it's unbounded, add a bound to it. (Example: add a limit to the amount of nesting that a parser supports.)
I don't believe you will find a language mandating this. But a particular implementation could offer such a mechanism, and depending on the operating system it can very well be that the runtime environment enlarges the stack automatically as needed.
According to gcc's documentation, gcc can generate code which does this, if you compile with the -fsplit_stack option:
-fsplit-stack
Generate code to automatically split the stack before it overflows.
The resulting program has a discontiguous stack which can only
overflow if the program is unable to allocate any more memory.
This is most useful when running threaded programs, as it is no
longer necessary to calculate a good stack size to use for each
thread. This is currently only implemented for the i386 and
x86_64 backends running GNU/Linux.
When code compiled with -fsplit-stack calls code compiled
without -fsplit-stack, there may not be much stack space
available for the latter code to run. If compiling all code,
including library code, with -fsplit-stack is not an option,
then the linker can fix up these calls so that the code compiled
without -fsplit-stack always has a large stack. Support for
this is implemented in the gold linker in GNU binutils release 2.21
and later.
The llvm code generation framework provides support for segmented stacks, which are used in the go language and were originally used in Mozilla's rust (although they were removed from rust on the grounds that the execution overhead is too high for a "high-performance language". (See this mailing list thread)
Despite the rust-team's objections, segmented stacks are surprisingly fast, although the stack-thrashing problem can impact particular programs. (Some Go programs suffer from this issue, too.)
Another mechanism for heap-allocating stack segments in a relatively efficient way was proposed by Henry Baker in his 1994 paper Cheney on the MTA and became the basis of the run-time for Chicken Scheme, a compiled mostly R5RS-compatible scheme implementation.
Recursion is certainly not avoided in production code -- it's just used where and when appropriate.
If you're worried about it, the right answer may not simply be to switch to a manually-maintained stack in a vector or whatever -- though you can do that -- but to reorganize the logic. For example, the compiler I was working on replaced one deep recursive process with a worklist-driven process, since there wasn't a need to maintain strict nesting in the order of processing, only to ensure that variables we had a dependency upon were computed before being used.
If you link with a thread library (e.g. pthreads in C), each thread has a separate stack. Those stacks are allocated one way or another, ultimately (in a UNIX environment) with brk or an anonymous mmap. These might or might not use the heap on the way.
I note all the above answers refer to separate stacks; none explicitly says "on the heap" (in the C sense). I am taking it the poster simply means "from dynamically allocated memory" rather than the calling processor stack.
(I think there's a high chance of this question either being a duplicate or otherwise answered here already, but searching for the answer is hard thanks to interference from "stack allocation" and related terms.)
I have a toy compiler I've been working on for a scripting language. In order to be able to pause the execution of a script while it's in progress and return to the host program, it has its own stack: a simple block of memory with a "stack pointer" variable that gets incremented using the normal C code operations for that sort of thing and so on and so forth. Not interesting so far.
At the moment I compile to C. But I'm interested in investigating compiling to machine code as well - while keeping the secondary stack and the ability to return to the host program at predefined control points.
So... I figure it's not likely to be a problem to use the conventional stack registers within my own code, I assume what happens to registers there is my own business as long as everything is restored when it's done (do correct me if I'm wrong on this point). But... if I want the script code to call out to some other library code, is it safe to leave the program using this "virtual stack", or is it essential that it be given back the original stack for this purpose?
Answers like this one and this one indicate that the stack isn't a conventional block of memory, but that it relies on special, system specific behaviour to do with page faults and whatnot.
So:
is it safe to move the stack pointers into some other area of memory? Stack memory isn't "special"? I figure threading libraries must do something like this, as they create more stacks...
assuming any area of memory is safe to manipulate using the stack registers and instructions, I can think of no reason why it would be a problem to call any functions with a known call depth (i.e. no recursion, no function pointers) as long as that amount is available on the virtual stack. Right?
stack overflow is obviously a problem in normal code anyway, but would there be any extra-disastrous consequences to an overflow in such a system?
This is obviously not actually necessary, since simply returning the pointers to the real stack would be perfectly serviceable, or for that matter not abusing them in the first place and just putting up with fewer registers, and I probably shouldn't try to do it at all (not least due to being obviously out of my depth). But I'm still curious either way. Want to know how these sorts of things work.
EDIT: Sorry of course, should have said. I'm working on x86 (32-bit for my own machine), Windows and Ubuntu. Nothing exotic.
All of these answer are based on "common processor architectures", and since it involves generating assembler code, it has to be "target specific" - if you decide to do this on processor X, which has some weird handling of stack, below is obviously not worth the screensurface it's written on [substitute for paper]. For x86 in general, the below holds unless otherwise stated.
is it safe to move the stack pointers into some other area of memory?
Stack memory isn't "special"? I figure threading libraries
must do something like this, as they create more stacks...
The memory as such is not special. This does however assume that it's not on an x86 architecture where the stack segment is used to limit the stack usage. Whilst that is possible, it's rather rare to see in an implementation. I know that some years ago Nokia had a special operating system using segments in 32-bit mode. As far as I can think of right now, that's the only one I've got any contact with that uses the stack segment for as x86-segmentation mode describes.
Assuming any area of memory is safe to manipulate using the stack
registers and instructions, I can think of no reason why it would be a
problem to call any functions with a known call depth (i.e. no
recursion, no function pointers) as long as that amount is available
on the virtual stack. Right?
Correct. Just as long as you don't expect to be able to get back to some other function without switching back to the original stack. Limited level of recursion would also be acceptable, as long as the stack is deep enough [there are certain types of problems that are definitely hard to solve without recursion - binary tree search for example].
stack overflow is obviously a problem in normal code anyway,
but would there be any extra-disastrous consequences to an overflow in
such a system?
Indeed, it would be a tough bug to crack if you are a little unlucky.
I would suggest that you use a call to VirtualProtect() (Windows) or mprotect() (Linux etc) to mark the "end of the stack" as unreadable and unwriteable so that if your code accidentally walks off the stack, it crashes properly rather than some other more subtle undefined behaviour [because it's not guaranteed that the memory just below (lower address) is unavailable, so you could overwrite some other useful things if it does go off the stack, and that would cause some very hard to debug bugs].
Adding a bit of code that occassionally checks the stack depth (you know where your stack starts and ends, so it shouldn't be hard to check if a particular stack value is "outside the range" [if you give yourself some "extra buffer space" between the top of the stack and the "we're dead" zone that you protected - a "crumble zone" as they would call it if it was a car in a crash]. You can also fill the entire stack with a recognisable pattern, and check how much of that is "untouched".
Typically, on x86, you can use the existing stack without any problems so long as:
you don't overflow it
you don't increment the stack pointer register (with pop or add esp, positive_value / sub esp, negative_value) beyond what your code starts with (if you do, interrupts or asynchronous callbacks (signals) or any other activity using the stack will trash its contents)
you don't cause any CPU exception (if you do, the exception handling code might not be able to unwind the stack to the nearest point where the exception can be handled)
The same applies to using a different block of memory for a temporary stack and pointing esp to its end.
The problem with exception handling and stack unwinding has to do with the fact that your compiled C and C++ code contains some exception-handling-related data structures like the ranges of eip with the links to their respective exception handlers (this tells where the closest exception handler is for every piece of code) and there's also some information related to identification of the calling function (i.e. where the return address is on the stack, etc), so you can bubble up exceptions. If you just plug in raw machine code into this "framework", you won't properly extend these exception-handling data structures to cover it, and if things go wrong, they'll likely go very wrong (the entire process may crash or become damaged, despite you having exception handlers around the generated code).
So, yeah, if you're careful, you can play with stacks.
You can use any region you like for the processor's stack (modulo the memory protections).
Essentially, you simply load the ESP register ("MOV ESP, ...") with a pointer to the new area, however you managed to allocate it.
You have to have enough for your program, and whatever it might call (e.g., a Windows OS API), and whatever funny behaviours the OS has. You might be able to figure out how much space your code needs; a good compiler can easily do that. Figuring how much is needed by Windows is harder; you can always allocate "way too much" which is what Windows programs tend to do.
If you decide to manage this space tightly, you'll probably have to switch stacks to call Windows functions. That won't be enough; you'll likely get burned by various Windows surprises. I describe one of them here Windows: avoid pushing full x86 context on stack. I have mediocre solutions, but not good solutions for this.
I'm trying to implement a simple mark and sweep garbage collector in C. The first step of the algorithm is finding the roots. So my question is how can I find the roots in a C program?
In the programs using malloc, I'll be using the custom allocator. This custom allocator is all that will be called from the C program, and may be a custom init().
How does garbage collector knows what all the pointers(roots) are in the program? Also, given a pointer of a custom type how does it get all pointers inside that?
For example, if there's a pointer p pointing to a class list, which has another pointer inside it.. say q. How does garbage collector knows about it, so that it can mark it?
Update: How about if I send all the pointer names and types to GC when I init it? Similarly, the structure of different types can also be sent so that GC can traverse the tree. Is this even a sane idea or am I just going crazy?
First off, garbage collectors in C, without extensive compiler and OS support, have to be conservative, because you cannot distinguish between a legitimate pointer and an integer that happens to have a value that looks like a pointer. And even conservative garbage collectors are hard to implement. Like, really hard. And often, you will need to constrain the language in order to get something acceptable: for instance, it might be impossible to correctly collect memory if pointers are hidden or obfuscated. If you allocate 100 bytes and only keep a pointer to the tenth byte of the allocation, your GC is unlikely to figure out that you still need the block since it will see no reference to the beginning. Another very important constraint to control is the memory alignment: if pointers can be on unaligned memory, your collector can be slowed down by a factor of 10x or worse.
To find roots, you need to know where your stacks start, and where your stacks end. Notice the plural form: each thread has its own stack, and you might need to account for that, depending on your objectives. To know where a stack starts, without entering into platform-specific details (that I probably wouldn't be able to provide anyways), you can use assembly code inside the main function of the current thread (just main in a non-threaded executable) to query the stack register (esp on x86, rsp on x86_64 to name those two only). Gcc and clang support a language extension that lets you assign a variable permanently to a register, which should make it easy for you:
register void* stack asm("esp"); // replace esp with the name of your stack reg
(register is a standard language keyword that is most of the time ignored by today's compilers, but coupled with asm("register_name"), it lets you do some nasty stuff.)
To ensure you don't forget important roots, you should defer the actual work of the main function to another one. (On x86 platforms, you can also query ebp/rbp, the stack frame base pointers, instead, and still do your actual work in the main function.)
int main(int argc, const char** argv, const char** envp)
{
register void* stack asm("esp");
// put stack somewhere
return do_main(argc, argv, envp);
}
Once you enter your GC to do collection, you need to query the current stack pointer for the thread you've interrupted. You will need design-specific and/or platform-specific calls for that (though if you get something to execute on the same thread, the technique above will still work).
The actual hunt for roots starts now. Good news: most ABIs will require stack frames to be aligned on a boundary greater than the size of a pointer, which means that if you trust every pointer to be on aligned memory, you can treat your whole stack as a intptr_t* and check if any pattern inside looks like any of your managed pointers.
Obviously, there are other roots. Global variables can (theoretically) be roots, and fields inside structures can be roots too. Registers can also have pointers to objects. You need to separately account for global variables that can be roots (or forbid that altogether, which isn't a bad idea in my opinion) because automatic discovery of those would be hard (at least, I wouldn't know how to do it on any platform).
These roots can lead to references on the heap, where things can go awry if you don't take care.
Since not all platforms provide malloc introspection (as far as I know), you need to implement the concept of scanned memory--that is, memory that your GC knows about. It needs to know at least the address and the size of each of such allocation. When you get a reference to one of these, you simply scan them for pointers, just like you did for the stack. (This means that you should take care that your pointers are aligned. This is normally the case if you let your compiler do its job, but you still need to be careful when you use third-party APIs).
This also means that you cannot put references to collectable memory to places where the GC can't reach it. And this is where it hurts the most and where you need to be extra-careful. Otherwise, if your platform supports malloc introspection, you can easily tell the size of each allocation you get a pointer to and make sure you don't overrun them.
This just scratches the surface of the topic. Garbage collectors are extremely complex, even when single-threaded. When you add threads to the mix, you enter a whole new world of hurt.
Apple has implemented such a conservative GC for the Objective-C language and dubbed it libauto. They have open-sourced it, along with a good part of the low-level technologies of Mac OS X, and you can find the source here.
I can only quote Hot Licks here: good luck!
Okay, before I go even further, I forgot something very important: compiler optimizations can break the GC. If your compiler is not aware of your GC, it can very well never put certain roots on the stack (only dealing with them in registers), and you're going to miss them. This is not too problematic for single-threaded programs if you can inspect registers, but again, a huge mess for multithreaded programs.
Also be very careful about the interruptibility of allocations: you must make sure that your GC cannot kick in while you're returning a new pointer because it could collect it right before it is assigned to a root, and when your program resumes it would assign that new dangling pointer to your program.
And here's an update to address the edit:
Update: How about if I send all the pointer names and types to GC when
I init it? Similarly, the structure of different types can also be
sent so that GC can traverse the tree. Is this even a sane idea or am
I just going crazy?
I guess you could allocate our memory then register it with the GC to tell it that it should be a managed resource. That would solve the interruptability problem. But then, be careful about what you send to third-party libraries, because if they keep a reference to it, your GC might not be able to detect it since they won't register their data structures with your GC.
And you likely won't be able to do that with roots on the stack.
The roots are basically all static and automatic object pointers. Static pointers would be linked inside the load modules. Automatic pointers must be found by scanning stack frames. Of course, you have no idea where in the stack frames the automatic pointers are.
Once you have the roots you need to scan objects and find all the pointers inside them. (This would include pointer arrays.) For that you need to identify the class object and somehow extract from it information about pointer locations. Of course, in C many objects are not virtual and do not have a class pointer within them.
Good luck!!
Added: One technique that could vaguely make your quest possible is "conservative" garbage collection. Since you intend to have your own allocator, you can (somehow) keep track of allocation sizes and locations, so you can pick any pointer-sized chunk out of storage and ask "Might this possibly be a pointer to one of my objects?" You can, of course, never know for sure, since random data might "look like" a pointer to one of your objects, but still you can, through this mechanism, scan a chunk of storage (like a frame in the call stack, or an individual object) and identify all the possible objects it might address.
With a conservative collector you cannot safely do object relocation/compaction (where you modify pointers to objects as you move them) since you might accidentally modify "random" data that looks like an object pointer but is in fact meaningful data to some application. But you can identify unused objects and free up the space they occupy for reuse. With proper design it's possible to have a very effective non-compacting GC.
(However, if your version of C allows unaligned pointers scanning could be very slow, since you'd have to try every variation on byte alignment.)