Avoiding stack overflows by allocating stack parts on the heap?

Avoiding stack overflows by allocating stack parts on the heap? - c

Is there a language where we can enable a mechanism that allocates new stack space on the heap when the original stack space is exceeded?
I remember doing a lab in my university where we fiddled with inline assembly in C to implement a heap-based extensible stack, so I know it should be possible in principle.
I understand it may be useful to get a stack overflow error when developing an app because it terminates a crazy infinite recursion quickly without making your system take lots of memory and begin to swap.
However, when you have a finished well-tested application that you want to deploy and you want it to be as robust as possible (say it's a pretty critical program running on a desktop computer), it would be nice to know it won't miserably fail on some other systems where the stack is more limited, where some objects take more space, or if the program is faced with a very particular case requiring more stack memory than in your tests.
I think it's because of these pitfalls that recursion is usually avoided in production code. But if we had a mechanism for automatic stack expansion in production code, we'd be able to write more elegant programs using recursion knowing it won't unexpectedly segfault while the system has 16 gigabytes of heap memory ready to be used...

There is precedent for it.
The runtime for GHC, a Haskell compiler, uses the heap instead of the stack. The stack is only used when you call into foreign code.
Google's Go implementation uses segmented stacks for goroutines, which enlarge the stack as necessary.
Mozilla's Rust used to use segmented stacks, although it was decided that it caused more problems than it solved (see [rust-dev] Abandoning segmented stacks in Rust).
If memory serves, some Scheme implementations put stack frames on the heap, then garbage collected the frames just like other objects.
In traditional programming styles for imperative languages, most code will avoid calling itself recursively. Stack overflows are rarely seen in the wild, and they're usually triggered by either sloppy programming or by malicious input--especially to recursive descent parsers and the like, which is why some parsers reject code when the nesting exceeds a threshold.
The traditional advice for avoiding stack overflows in production code:
Don't write recursive code. (Example: rewrite a search algorithm to use an explicit stack.)
If you do write recursive code, prove that recursion is bounded. (Example: searching a balanced tree is bounded by the logarithm of the size of the tree.)
If you can't prove that it's unbounded, add a bound to it. (Example: add a limit to the amount of nesting that a parser supports.)

I don't believe you will find a language mandating this. But a particular implementation could offer such a mechanism, and depending on the operating system it can very well be that the runtime environment enlarges the stack automatically as needed.

According to gcc's documentation, gcc can generate code which does this, if you compile with the -fsplit_stack option:
-fsplit-stack
Generate code to automatically split the stack before it overflows.
The resulting program has a discontiguous stack which can only
overflow if the program is unable to allocate any more memory.
This is most useful when running threaded programs, as it is no
longer necessary to calculate a good stack size to use for each
thread. This is currently only implemented for the i386 and
x86_64 backends running GNU/Linux.
When code compiled with -fsplit-stack calls code compiled
without -fsplit-stack, there may not be much stack space
available for the latter code to run. If compiling all code,
including library code, with -fsplit-stack is not an option,
then the linker can fix up these calls so that the code compiled
without -fsplit-stack always has a large stack. Support for
this is implemented in the gold linker in GNU binutils release 2.21
and later.
The llvm code generation framework provides support for segmented stacks, which are used in the go language and were originally used in Mozilla's rust (although they were removed from rust on the grounds that the execution overhead is too high for a "high-performance language". (See this mailing list thread)
Despite the rust-team's objections, segmented stacks are surprisingly fast, although the stack-thrashing problem can impact particular programs. (Some Go programs suffer from this issue, too.)
Another mechanism for heap-allocating stack segments in a relatively efficient way was proposed by Henry Baker in his 1994 paper Cheney on the MTA and became the basis of the run-time for Chicken Scheme, a compiled mostly R5RS-compatible scheme implementation.

Recursion is certainly not avoided in production code -- it's just used where and when appropriate.
If you're worried about it, the right answer may not simply be to switch to a manually-maintained stack in a vector or whatever -- though you can do that -- but to reorganize the logic. For example, the compiler I was working on replaced one deep recursive process with a worklist-driven process, since there wasn't a need to maintain strict nesting in the order of processing, only to ensure that variables we had a dependency upon were computed before being used.

If you link with a thread library (e.g. pthreads in C), each thread has a separate stack. Those stacks are allocated one way or another, ultimately (in a UNIX environment) with brk or an anonymous mmap. These might or might not use the heap on the way.
I note all the above answers refer to separate stacks; none explicitly says "on the heap" (in the C sense). I am taking it the poster simply means "from dynamically allocated memory" rather than the calling processor stack.

Related

How come the stack cannot be increased during runtime in most operating system?

Is it to avoid fragmentation? Or some other reason? A set lifetime for a memory allocation is a pretty useful construct, compared to malloc() which has a manual lifetime.

The space used for stack is increased and decreased frequently during program execution, as functions are called and return. The maximum space allowed for the stack is commonly a fixed limit.
In older computers, memory was a very limited resource, and it may still be in small devices. In these cases, the hardware capability may impose a necessary limit on the maximum stack size.
In modern systems with plenty of memory, a great deal may be available for stack, but the maximum allowed is generally set to some lower value. That provides a means of catching “runaway” programs. Generally, controlling how much stack space a program uses is a neglected part of software engineering. Limits have been set based on practical experience, but we could do better.1
Theoretically, a program could be given a small initial limit and could, if it finds itself working on a “large” problem, tell the operating system it intends to use more. That would still catch most “runaway” programs while allowing well crafted programs to use more space. However, by and large we design programs to use stack for program control (managing function calls and returns, along with a modest amount of space for local data) and other memory for the data being operated on. So stack space is largely a function of program design (which is fixed) rather than problem size. That model has been working well, so we keep using it.
Footnote
1 For example, compilers could report, for each routine that does not use objects with run-time variable size, the maximum space used by the routine in any path through it. Linkers or other tools could report, for any call tree [hence without loops] the maximum stack space used. Additional tools could help analyze stack use in call graphs with potential recursion.

How come the stack cannot be increased during runtime in most operating system?
This is wrong for Linux. On recent Linux systems, each thread has its own call stack (see pthreads(7)), and an application could (with clever tricks) increase some call stacks using mmap(2) and mremap(2) after querying the call stacks thru /proc/ (see proc(5) and use /proc/self/maps) like e.g. pmap(1) does.
Of course, such code is architecture specific, since in some cases the call stack grows towards increasing addresses and in other cases towards decreasing addresses.
Read also Operating Systems: Three Easy Pieces and the OSDEV wiki, and study the source code of GNU libc.
BTW, Appel's book Compiling with Continuations, his old paper Garbage Collection can be faster than Stack Allocation and this paper on Compiling with Continuations and LLVM could interest you, and both are very related to your question: sometimes, there is almost "no call stack" and it makes no sense to "increase it".

Is using canaries for bss or data-sections to detect overflows/smashing useful?

In our GCC-based C embedded system we are using the -ffunction-sections and -fdata-sections options to allow the linker, when linking the final executable, to remove unused (unreferenced) sections. This works well since years.
In the same system most of the data-structures and buffers are allocated statically (often as static-variables at file-scope).
Of course we have bugs, sometimes nasty ones, where we would like to quickly exclude the possibility of buffer-overflows.
One idea we have is to place canaries in between each bss-section and data-section - each one presenting exactly one symbol (because of -fdata-sections). Like the compiler is doing for functions-stacks when Stack-Smashing and StackProtection is activated. Checking these canaries could be done from the host by reading the canary-addresses "from time to time".
It seems that modifying the linker-script (placing manually the section and adding a canary-word in between) seems feasible, but does it make sense?
Is there a project or an article in the wild? Using my keywords I couldn't find anything.

Canaries are mostly useful for the stack, since it expands and collapses beyond the programmer's direct control. The things you have on data/bss do not behave like that. Either they are static variables, or in case they are buffers, they should keep within their fixed size, which should be checked with defensive programming in-place with the algorithm, rather than unorthodox tricks.
Also, stack canaries are used specifically in RAM-based, PC-like systems that don't know any better way. In embedded systems, they aren't very meaningful. Some useful things you can do instead:
Memory map the stack so that it grows into a memory area where writes will yield a hardware exception. Like for example, if your MCU has the ability to separate executable memory from data memory and yield exceptions if you try to execute code in the data area, or write to the executable area.
Ensure that everything in your program dealing with buffers perform their error checks and not write out-of-bounds. Static analysis tools are usually decent at spotting out-of-bounds bugs. Even some compilers can do this.
Add lots of defensive programming with static asserts. Check sizes of structs, buffers etc at compile-time, it's free.
Run-time defensive programming. For example if(x==good) {...} else if(x == bad) {... } is missing an else. And switch(x) case A: { ... } is missing a default. "But it can't go there in theory!" No but in practice, when you get runaway code caused by bugs (very likely), data retention of flash (100% likely) or EMI influence on RAM (quite unlikely).
And so on.

Usage example of the nonexistent sstk() system call?

Keeping with the festivities of Stackoverflow's new logo design, I was curious what the point of sstk() was supposed to be in BSD and other UNIX-Like operating systems?
According to the Linux kernel system call interface manpages, sstk(2) was supposed to:
...[change] the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward.
However, also according to the manual:
This call is not supported in 4.3BSD or 4.4BSD or glibc or Linux or any other known Unix-like system. Some systems have a routine of this name that returns ENOSYS.
Which can be noticed by viewing glibc's sstk.c source
My question is, why would one want to manually change the size of the stack? sbrk() and friends make sense, but is there any use in manually re-sizing the stack size in your program manually?

As I first expressed in comments, the fact that the call is not supported any-known-where suggests that in fact there isn't much point to it, at least not any more.
The docs on linux.die.net attribute the function's heritage to BSD (though it's apparently not supported in modern BSD any more than it is anywhere else), and BSD traces its lineage to bona fide AT&T Unix. It may have made more sense in days when RAM was precious. In those days, you also might not have been able to rely on the stack size being increased automatically. Thus you might, say, enlarge the stack dynamically within a deeply recursive algorithm, and then shrink it back afterward.

Another plausible reason for explicitly growing the stack with a syscall would be so you can get a clean error indication if the request is too big, as opposed to the normal method of handling stack allocation failures (i.e. don't even try, if any allocation fails, just let the process crash).
It will be hard to know exactly how much stack space you need to perform some recursive operation, but you could make a reasonable guess and sstk(guess*10) just to be sure.

Catching stack overflow

What's the best way to catch stack overflow in C?
More specifically:
A C program contains an interpreter for a scripting language.
Scripts are not trusted, and may contain infinite recursion bugs. The interpreter has to be able to catch these and smoothly continue. (Obviously this can partly be handled by using a software stack, but performance is greatly improved if substantial chunks of library code can be written in C; at a minimum, this entails C functions running over recursive data structures created by scripts.)
The preferred form of catching a stack overflow would involve longjmp back to the main loop. (It's perfectly okay to discard all data that was held in stack frames below the main loop.)
The fallback portable solution is to use addresses of local variables to monitor the current stack depth, and for every recursive function to contain a call to a stack checking function that uses this method. Of course, this incurs some runtime overhead in the normal case; it also means if I forget to put the stack check call in one place, the interpreter will have a latent bug.
Is there a better way of doing it? Specifically, I'm not expecting a better portable solution, but if I had a system specific solution for Linux and another one for Windows, that would be okay.
I've seen references to something called structured exception handling on Windows, though the references I've seen have been about translating this into the C++ exception handling mechanism; can it be accessed from C, and if so is it useful for this scenario?
I understand Linux lets you catch a segmentation fault signal; is it possible to reliably turn this into a longjmp back to your main loop?
Java seems to support catching stack overflow exceptions on all platforms; how does it implement this?

Off the top of my head, one way to catch excessive stack growth is to check the relative difference in addresses of stack frames:
#define MAX_ROOM (64*1024*1024UL) // 64 MB
static char * first_stack = NULL;
void foo(...args...)
{
char stack;
// Compare addresses of stack frames
if (first_stack == NULL)
first_stack = &stack;
if (first_stack > &stack && first_stack - &stack > MAX_ROOM ||
&stack > first_stack && &stack - first_stack > MAX_ROOM)
printf("Stack is larger than %lu\n", (unsigned long)MAX_ROOM);
...code that recursively calls foo()...
}
This compares the address of the first stack frame for foo() to the current stack frame address, and if the difference exceeds MAX_ROOM it writes a message.
This assumes that you're on an architecture that uses a linear always-grows-down or always-grows-up stack, of course.
You don't have to do this check in every function, but often enough that excessively large stack growth is caught before you hit the limit you've chosen.

AFAIK, all mechanisms for detecting stack overflow will incur some runtime cost. You could let the CPU detect seg-faults, but that's already too late; you've probably already scribbled all over something important.
You say that you want your interpreter to call precompiled library code as much as possible. That's fine, but to maintain the notion of a sandbox, your interpreter engine should always be responsible for e.g. stack transitions and memory allocation (from the interpreted language's point of view); your library routines should probably be implemented as callbacks. The reason being that you need to be handling this sort of thing at a single point, for reasons that you've already pointed out (latent bugs).
Things like Java deal with this by generating machine code, so it's simply a case of generating code to check this at every stack transition.

(I won't bother those methods depending on particular platforms for "better" solutions. They make troubles, by limiting the language design and usability, with little gain. For answers "just work" on Linux and Windows, see above.)
First of all, in the sense of C, you can't do it in a portable way. In fact, ISO C mandates no "stack" at all. Pedantically, it even seems when allocation of automatic objects failed, the behavior is literally undefined, as per Clause 4p2 - there is simply no guarantee what would happen when the calls nested too deep. You have to rely on some additional assumptions of implementation (of ISA or OS ABI) to do that, so you end up with C + something else, not only C. Runtime machine code generation is also not portable in C level.
(BTW, ISO C++ has a notion of stack unwinding, but only in the context of exception handling. And there is still no guarantee of portable behavior on stack overflow; though it seems to be unspecified, not undefined.)
Besides to limit the call depth, all ways have some extra runtime cost. The cost would be quite easily observable unless there are some hardware-assisted means to amortize it down (like page table walking). Sadly, this is not the case now.
The only portable way I find is to not rely on the native stack of underlying machine architecture. This in general means you have to allocate the activation record frames as part of the free store (on the heap), rather than the native stack provided by ISA. This does not only work for interpreted language implementations, but also for compiled ones, e.g. SML/NJ. Such software stack approach does not always incur worse performance because they allow providing higher level abstraction in the object language so the programs may have more opportunities to be optimized, though it is not likely in a naive interpreter.
You have several options to achieve this. One way is to write a virtual machine. You can allocate memory and build the stack in it.
Another way is to write sophisticated asynchronous style code (e.g. trampolines, or CPS transformation) in your implementation instead, relying on less native call frames as possible. It is generally difficult to get right, but it works. Additional capabilities enabled by such way are easier tail call optimization and easier first-class continuation capture.

Mechanism of the Boehm Weiser Garbage Collector

I was reading the paper "Garbage Collector in an Uncooperative Environment" and wondering how hard it would be to implement it. The paper describes a need to collect all addresses from the processor (in addition to the stack). The stack part seems intuitive. Is there any way to collect addresses from the registers other than enumerating each register explicitly in assembly? Let's assume x86_64 on a POSIX-like system such as linux or mac.
SetJmp

Since Boehm and Weiser actually implemented their GC, then a basic source of information is the source code of that implementation (it is opensource).
To collect the register values, you may want to subvert the setjmp() function, which saves a copy of the registers in a custom structure (at least those registers which are supposed to be preserved across function calls). But that structure is not standardized (its contents are nominally opaque) and setjmp() may be specially handled by the C compiler, making it a bit delicate to use for anything other than a longjmp() (which is already quite hard as it is). A piece of inline assembly seems much easier and safer.
The first hard part in the GC implementation seems to be able to reliably detect the start and end of stacks (note the plural: there may be threads, each with its own stack). This requires delving into ill-documented details of OS ABI. When my desktop system was an Alpha machine running FreeBSD, the Boehm-Weiser implementation could not run on it (although it supported Linux on the same processor).
The second hard part will be when trying to go generational, trapping write accesses by playing with page access rights. This again will require reading some documentation of questionable existence, and some inline assembly.

I think on x86_86 they use the flushrs assembly instruction to put the registers on the stack. I am sure someone on stack overflow will correct me if this is wrong.

It is not hard to implement a naive collector: it's just an algorithm after all. The hard bits are as stated, but I will add the worst ones: tracking exceptions is nasty, and stopping threads is even worse: that one can't be done at all on some platforms. There's also the problem of trapping all pointers that get handed over to the OS and lost from the program temporarily (happens a lot in Windows window message handlers).
My own multi-threaded GC is similar to the Boehm collector and more or less standard C++ with few hacks (using jmpbuf is more or less certain to work) and a slightly less hostile environment (no exceptions). But it stops the world by cooperation, which is very bad: if you have a busy CPU the idle ones wait for it. Boehm uses signals or other OS features to try to stop threads but the support is very flaky.
And note also the Intel i64 processor has two stacks per thread .. a bit hard to account for this kind of thing generically.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight