Call C function with different stack pointer (gcc) - c

I'm looking for a way to call a C function in a different stack, i.e. save the current stack pointer, set the stack pointer to a different location, call the function and restore the old stack pointer when it returns.
The purpose of this is a lightweight threading system for a programming language. Threads will operate on very small stacks, check when more stack is needed and dynamically resize it. This is so that thousands of threads can be allocated without wasting a lot of memory. When calling in to C code it is not safe to use a tiny stack, since the C code does not know about checking and resizing, so I want to use a big pthread stack which is used only for calling C (shared between lightweight threads on the same pthread).
Now I could write assembly code stubs which will work fine, but I wondered if there is a better way to do this, such as a gcc extension or a library which already implements it. If not, then I guess I'll have my head buried in ABI and assembly language manuals ;-) I only ask this out of laziness and not wanting to reinvent the wheel.

Assuming you're using POSIX threads and on a POSIX system, you can achieve this with signals. Setup an alternate signal handling stack (sigaltstack) and designate one special real-time signal to have its handler run on the alternate signal stack. Then raise the signal to switch to the stack, and have the signal handler read the data for what function to call, and what argument to pass it, from thread-local data.
Note that this approach is fairly expensive (multiple system calls to change stacks), but should be 100% portable to POSIX systems. Since it's slow, you might want to make arch-specific call-on-alt-stack functions written in assembly, and only use my general solution as a fallback for archs where you haven't written an assembly version.

Related

Safe usage of `setjmp` and `longjmp`

I know people always say don't use longjmp, it's evil, it's dangerous.
But I think it can be useful for exiting deep recursions/nested function calls.
Is a single longjmp faster than a lot of repeated checks and returns like if(returnVal != SUCCESS) return returnVal;?
As for safety, as long as dynamic memory and other resources are released properly, there shouldn't be a problem, right?
So far it seems using longjmp isn't difficult and it even makes my code terser. I'm tempted to use it a lot.
(IMHO in many cases there is no dynamic memory/resources allocated within a deep recursion in the first place. Deep function call seems more common for data parsing/manipulation/validation. Dynamic allocation often happens at a higher level, before invoking the function where setjmp appears.)
setjmp and longjmp can be seen as a poor man's exception mechanism. BTW, Ocaml exceptions are as quick as setjmp but have a much clearer semantics.
Of course a longjmp is much faster than repeatedly returning error codes in intermediate functions, since it pops up a perhaps significant call stack portion.
(I am implicitly focusing on Linux)
They are valid and useful as long as no resources are allocated between them, including:
heap memory (malloc)
fopen-ing FILE* handles
opening operating system file descriptors (e.g. for sockets)
other operating system resources, such as timers or signal handlers
getting some external resource managed by some server, e.g. X11 windows (hence using any widget toolkit like GTK), or database handle or connection...
etc...
The main issue is that that property of not leaking resources is a global whole-program property (or at least global to all functions possibly called between setjmp and longjmp), so it prohibits modular software development : any other colleague having to improve some code in any function between setjmp and longjmp has to be aware of that limitation and follow that discipline.
Hence, if you use setjmp document that very clearly.
BTW, if you only care about malloc, using systematically Boehm's conservative garbage collector would help a lot; you'll use GC_malloc instead of malloc everywhere and you won't care about free, and practically that is enough; then you can use setjmp without fears (since you could call GC_malloc between setjmp and longjmp).
(notice that the concepts and the terminology around garbage collector are quite related to exception handling and setjmp, but many people don't know them enough. Reading the Garbage Collection Handbook should be worthwhile)
Read also about RAII and learn about C++11 exceptions (and their relation to destructors). Learn a bit about continuations and CPS.
Read setjmp(3), longjmp(3) (and also about sigsetjmp, siglongjmp, and setcontext(3)) and be aware that the compiler has to know about setjmp
You should note that calling setjmp in some contexts is not guaranteed to be safe (for example, you can't portably store the return value of setjmp).
Also, if you want to access local variables after calling setjmp, in the same function, that could have been changed you should mark that variables as volatile.
Using setjmp and longjmp is also useful because if the recursion causes a Stack Overflow, you can recover with a longjmp from the signal handler (don't forget to set an alternate stack) and return an error instead. If you want to do that you should consider to use sigsetjmp and siglongjmp for preserving signal dispositions.

How to get the size and the starting address of the stack per thread in posix C?

How can I get the size and the starting address of the stack per thread in posix C? Or if there's no standard posix way to do this, at least in Linux with gcc.
Some programs such as the Boehm-gc should do this somehow, but I'm now quite confused reading their code. Can you give me some function names?
The "clean" but non-portable way to do this is to use the pthread_getattr_np (Linux/glibc, etc.) or similar function to obtain an attributes object for the thread in question, then pthread_attr_getstack to obtain the stack base/size. There is no portable way to do this, however, and there's essentially nothing portable you could do with the results anyway.
For the single-threaded case, just take the address of a local variable in the original and current frames.
Any address that lies between the current function's stack and main's stack must be in the stack.
Note that this does not apply to variables located directly. You may have to disable inlining for a handful of functions.

Avoiding stack overflows by allocating stack parts on the heap?

Is there a language where we can enable a mechanism that allocates new stack space on the heap when the original stack space is exceeded?
I remember doing a lab in my university where we fiddled with inline assembly in C to implement a heap-based extensible stack, so I know it should be possible in principle.
I understand it may be useful to get a stack overflow error when developing an app because it terminates a crazy infinite recursion quickly without making your system take lots of memory and begin to swap.
However, when you have a finished well-tested application that you want to deploy and you want it to be as robust as possible (say it's a pretty critical program running on a desktop computer), it would be nice to know it won't miserably fail on some other systems where the stack is more limited, where some objects take more space, or if the program is faced with a very particular case requiring more stack memory than in your tests.
I think it's because of these pitfalls that recursion is usually avoided in production code. But if we had a mechanism for automatic stack expansion in production code, we'd be able to write more elegant programs using recursion knowing it won't unexpectedly segfault while the system has 16 gigabytes of heap memory ready to be used...
There is precedent for it.
The runtime for GHC, a Haskell compiler, uses the heap instead of the stack. The stack is only used when you call into foreign code.
Google's Go implementation uses segmented stacks for goroutines, which enlarge the stack as necessary.
Mozilla's Rust used to use segmented stacks, although it was decided that it caused more problems than it solved (see [rust-dev] Abandoning segmented stacks in Rust).
If memory serves, some Scheme implementations put stack frames on the heap, then garbage collected the frames just like other objects.
In traditional programming styles for imperative languages, most code will avoid calling itself recursively. Stack overflows are rarely seen in the wild, and they're usually triggered by either sloppy programming or by malicious input--especially to recursive descent parsers and the like, which is why some parsers reject code when the nesting exceeds a threshold.
The traditional advice for avoiding stack overflows in production code:
Don't write recursive code. (Example: rewrite a search algorithm to use an explicit stack.)
If you do write recursive code, prove that recursion is bounded. (Example: searching a balanced tree is bounded by the logarithm of the size of the tree.)
If you can't prove that it's unbounded, add a bound to it. (Example: add a limit to the amount of nesting that a parser supports.)
I don't believe you will find a language mandating this. But a particular implementation could offer such a mechanism, and depending on the operating system it can very well be that the runtime environment enlarges the stack automatically as needed.
According to gcc's documentation, gcc can generate code which does this, if you compile with the -fsplit_stack option:
-fsplit-stack
Generate code to automatically split the stack before it overflows.
The resulting program has a discontiguous stack which can only
overflow if the program is unable to allocate any more memory.
This is most useful when running threaded programs, as it is no
longer necessary to calculate a good stack size to use for each
thread. This is currently only implemented for the i386 and
x86_64 backends running GNU/Linux.
When code compiled with -fsplit-stack calls code compiled
without -fsplit-stack, there may not be much stack space
available for the latter code to run. If compiling all code,
including library code, with -fsplit-stack is not an option,
then the linker can fix up these calls so that the code compiled
without -fsplit-stack always has a large stack. Support for
this is implemented in the gold linker in GNU binutils release 2.21
and later.
The llvm code generation framework provides support for segmented stacks, which are used in the go language and were originally used in Mozilla's rust (although they were removed from rust on the grounds that the execution overhead is too high for a "high-performance language". (See this mailing list thread)
Despite the rust-team's objections, segmented stacks are surprisingly fast, although the stack-thrashing problem can impact particular programs. (Some Go programs suffer from this issue, too.)
Another mechanism for heap-allocating stack segments in a relatively efficient way was proposed by Henry Baker in his 1994 paper Cheney on the MTA and became the basis of the run-time for Chicken Scheme, a compiled mostly R5RS-compatible scheme implementation.
Recursion is certainly not avoided in production code -- it's just used where and when appropriate.
If you're worried about it, the right answer may not simply be to switch to a manually-maintained stack in a vector or whatever -- though you can do that -- but to reorganize the logic. For example, the compiler I was working on replaced one deep recursive process with a worklist-driven process, since there wasn't a need to maintain strict nesting in the order of processing, only to ensure that variables we had a dependency upon were computed before being used.
If you link with a thread library (e.g. pthreads in C), each thread has a separate stack. Those stacks are allocated one way or another, ultimately (in a UNIX environment) with brk or an anonymous mmap. These might or might not use the heap on the way.
I note all the above answers refer to separate stacks; none explicitly says "on the heap" (in the C sense). I am taking it the poster simply means "from dynamically allocated memory" rather than the calling processor stack.

Which Unix don't have a thread-safe malloc?

I want my C program to be portable even on very old Unix OS but the problem is that I'm using pthreads and dynamic allocation (malloc). All Unix I know of have a thread-safe malloc (Linux, *BSD, Irix, Solaris) however this is not guaranteed by the C standard, and I'm sure there are very old versions where this is not true.
So, is there some list of platforms that I'd need to wrap malloc() calls with a mutex lock? I plan to write a ./configure test that checks if current platform is in that list.
The other alternative would be to test malloc() for thread-safety, but I know of no deterministic way to do this. Any ideas on this one too?
The only C standard that has threads (and can thus is relevant to your question) is C11, which states:
For purposes of determining the existence of a data race, memory
allocation functions behave as though they accessed only memory
locations accessible through their arguments and not other static
duration storage.
Or in other words, as long as two threads don't pass the same address to realloc or free all calls to the memory functions are thread safe.
For POSIX, that is all Unix'es that you can find nowadays you have:
Each function defined in the System Interfaces volume of IEEE Std 1003.1-2001 is thread-safe unless explicitly stated otherwise.
I don't know from where you take your assertion that malloc wouldn't be thread safe for older Unixes, a system with threads that doesn't implement that thread safe is pretty much useless. What might be a problem on such an older system is performance, but it should always be functional.

Mechanism of the Boehm Weiser Garbage Collector

I was reading the paper "Garbage Collector in an Uncooperative Environment" and wondering how hard it would be to implement it. The paper describes a need to collect all addresses from the processor (in addition to the stack). The stack part seems intuitive. Is there any way to collect addresses from the registers other than enumerating each register explicitly in assembly? Let's assume x86_64 on a POSIX-like system such as linux or mac.
SetJmp
Since Boehm and Weiser actually implemented their GC, then a basic source of information is the source code of that implementation (it is opensource).
To collect the register values, you may want to subvert the setjmp() function, which saves a copy of the registers in a custom structure (at least those registers which are supposed to be preserved across function calls). But that structure is not standardized (its contents are nominally opaque) and setjmp() may be specially handled by the C compiler, making it a bit delicate to use for anything other than a longjmp() (which is already quite hard as it is). A piece of inline assembly seems much easier and safer.
The first hard part in the GC implementation seems to be able to reliably detect the start and end of stacks (note the plural: there may be threads, each with its own stack). This requires delving into ill-documented details of OS ABI. When my desktop system was an Alpha machine running FreeBSD, the Boehm-Weiser implementation could not run on it (although it supported Linux on the same processor).
The second hard part will be when trying to go generational, trapping write accesses by playing with page access rights. This again will require reading some documentation of questionable existence, and some inline assembly.
I think on x86_86 they use the flushrs assembly instruction to put the registers on the stack. I am sure someone on stack overflow will correct me if this is wrong.
It is not hard to implement a naive collector: it's just an algorithm after all. The hard bits are as stated, but I will add the worst ones: tracking exceptions is nasty, and stopping threads is even worse: that one can't be done at all on some platforms. There's also the problem of trapping all pointers that get handed over to the OS and lost from the program temporarily (happens a lot in Windows window message handlers).
My own multi-threaded GC is similar to the Boehm collector and more or less standard C++ with few hacks (using jmpbuf is more or less certain to work) and a slightly less hostile environment (no exceptions). But it stops the world by cooperation, which is very bad: if you have a busy CPU the idle ones wait for it. Boehm uses signals or other OS features to try to stop threads but the support is very flaky.
And note also the Intel i64 processor has two stacks per thread .. a bit hard to account for this kind of thing generically.

Resources