Mechanism of the Boehm Weiser Garbage Collector - c

I was reading the paper "Garbage Collector in an Uncooperative Environment" and wondering how hard it would be to implement it. The paper describes a need to collect all addresses from the processor (in addition to the stack). The stack part seems intuitive. Is there any way to collect addresses from the registers other than enumerating each register explicitly in assembly? Let's assume x86_64 on a POSIX-like system such as linux or mac.
SetJmp

Since Boehm and Weiser actually implemented their GC, then a basic source of information is the source code of that implementation (it is opensource).
To collect the register values, you may want to subvert the setjmp() function, which saves a copy of the registers in a custom structure (at least those registers which are supposed to be preserved across function calls). But that structure is not standardized (its contents are nominally opaque) and setjmp() may be specially handled by the C compiler, making it a bit delicate to use for anything other than a longjmp() (which is already quite hard as it is). A piece of inline assembly seems much easier and safer.
The first hard part in the GC implementation seems to be able to reliably detect the start and end of stacks (note the plural: there may be threads, each with its own stack). This requires delving into ill-documented details of OS ABI. When my desktop system was an Alpha machine running FreeBSD, the Boehm-Weiser implementation could not run on it (although it supported Linux on the same processor).
The second hard part will be when trying to go generational, trapping write accesses by playing with page access rights. This again will require reading some documentation of questionable existence, and some inline assembly.

I think on x86_86 they use the flushrs assembly instruction to put the registers on the stack. I am sure someone on stack overflow will correct me if this is wrong.

It is not hard to implement a naive collector: it's just an algorithm after all. The hard bits are as stated, but I will add the worst ones: tracking exceptions is nasty, and stopping threads is even worse: that one can't be done at all on some platforms. There's also the problem of trapping all pointers that get handed over to the OS and lost from the program temporarily (happens a lot in Windows window message handlers).
My own multi-threaded GC is similar to the Boehm collector and more or less standard C++ with few hacks (using jmpbuf is more or less certain to work) and a slightly less hostile environment (no exceptions). But it stops the world by cooperation, which is very bad: if you have a busy CPU the idle ones wait for it. Boehm uses signals or other OS features to try to stop threads but the support is very flaky.
And note also the Intel i64 processor has two stacks per thread .. a bit hard to account for this kind of thing generically.

Related

How come the stack cannot be increased during runtime in most operating system?

Is it to avoid fragmentation? Or some other reason? A set lifetime for a memory allocation is a pretty useful construct, compared to malloc() which has a manual lifetime.
The space used for stack is increased and decreased frequently during program execution, as functions are called and return. The maximum space allowed for the stack is commonly a fixed limit.
In older computers, memory was a very limited resource, and it may still be in small devices. In these cases, the hardware capability may impose a necessary limit on the maximum stack size.
In modern systems with plenty of memory, a great deal may be available for stack, but the maximum allowed is generally set to some lower value. That provides a means of catching “runaway” programs. Generally, controlling how much stack space a program uses is a neglected part of software engineering. Limits have been set based on practical experience, but we could do better.1
Theoretically, a program could be given a small initial limit and could, if it finds itself working on a “large” problem, tell the operating system it intends to use more. That would still catch most “runaway” programs while allowing well crafted programs to use more space. However, by and large we design programs to use stack for program control (managing function calls and returns, along with a modest amount of space for local data) and other memory for the data being operated on. So stack space is largely a function of program design (which is fixed) rather than problem size. That model has been working well, so we keep using it.
Footnote
1 For example, compilers could report, for each routine that does not use objects with run-time variable size, the maximum space used by the routine in any path through it. Linkers or other tools could report, for any call tree [hence without loops] the maximum stack space used. Additional tools could help analyze stack use in call graphs with potential recursion.
How come the stack cannot be increased during runtime in most operating system?
This is wrong for Linux. On recent Linux systems, each thread has its own call stack (see pthreads(7)), and an application could (with clever tricks) increase some call stacks using mmap(2) and mremap(2) after querying the call stacks thru /proc/ (see proc(5) and use /proc/self/maps) like e.g. pmap(1) does.
Of course, such code is architecture specific, since in some cases the call stack grows towards increasing addresses and in other cases towards decreasing addresses.
Read also Operating Systems: Three Easy Pieces and the OSDEV wiki, and study the source code of GNU libc.
BTW, Appel's book Compiling with Continuations, his old paper Garbage Collection can be faster than Stack Allocation and this paper on Compiling with Continuations and LLVM could interest you, and both are very related to your question: sometimes, there is almost "no call stack" and it makes no sense to "increase it".

Usage example of the nonexistent sstk() system call?

Keeping with the festivities of Stackoverflow's new logo design, I was curious what the point of sstk() was supposed to be in BSD and other UNIX-Like operating systems?
According to the Linux kernel system call interface manpages, sstk(2) was supposed to:
...[change] the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward.
However, also according to the manual:
This call is not supported in 4.3BSD or 4.4BSD or glibc or Linux or any other known Unix-like system. Some systems have a routine of this name that returns ENOSYS.
Which can be noticed by viewing glibc's sstk.c source
My question is, why would one want to manually change the size of the stack? sbrk() and friends make sense, but is there any use in manually re-sizing the stack size in your program manually?
As I first expressed in comments, the fact that the call is not supported any-known-where suggests that in fact there isn't much point to it, at least not any more.
The docs on linux.die.net attribute the function's heritage to BSD (though it's apparently not supported in modern BSD any more than it is anywhere else), and BSD traces its lineage to bona fide AT&T Unix. It may have made more sense in days when RAM was precious. In those days, you also might not have been able to rely on the stack size being increased automatically. Thus you might, say, enlarge the stack dynamically within a deeply recursive algorithm, and then shrink it back afterward.
Another plausible reason for explicitly growing the stack with a syscall would be so you can get a clean error indication if the request is too big, as opposed to the normal method of handling stack allocation failures (i.e. don't even try, if any allocation fails, just let the process crash).
It will be hard to know exactly how much stack space you need to perform some recursive operation, but you could make a reasonable guess and sstk(guess*10) just to be sure.

Avoiding stack overflows by allocating stack parts on the heap?

Is there a language where we can enable a mechanism that allocates new stack space on the heap when the original stack space is exceeded?
I remember doing a lab in my university where we fiddled with inline assembly in C to implement a heap-based extensible stack, so I know it should be possible in principle.
I understand it may be useful to get a stack overflow error when developing an app because it terminates a crazy infinite recursion quickly without making your system take lots of memory and begin to swap.
However, when you have a finished well-tested application that you want to deploy and you want it to be as robust as possible (say it's a pretty critical program running on a desktop computer), it would be nice to know it won't miserably fail on some other systems where the stack is more limited, where some objects take more space, or if the program is faced with a very particular case requiring more stack memory than in your tests.
I think it's because of these pitfalls that recursion is usually avoided in production code. But if we had a mechanism for automatic stack expansion in production code, we'd be able to write more elegant programs using recursion knowing it won't unexpectedly segfault while the system has 16 gigabytes of heap memory ready to be used...
There is precedent for it.
The runtime for GHC, a Haskell compiler, uses the heap instead of the stack. The stack is only used when you call into foreign code.
Google's Go implementation uses segmented stacks for goroutines, which enlarge the stack as necessary.
Mozilla's Rust used to use segmented stacks, although it was decided that it caused more problems than it solved (see [rust-dev] Abandoning segmented stacks in Rust).
If memory serves, some Scheme implementations put stack frames on the heap, then garbage collected the frames just like other objects.
In traditional programming styles for imperative languages, most code will avoid calling itself recursively. Stack overflows are rarely seen in the wild, and they're usually triggered by either sloppy programming or by malicious input--especially to recursive descent parsers and the like, which is why some parsers reject code when the nesting exceeds a threshold.
The traditional advice for avoiding stack overflows in production code:
Don't write recursive code. (Example: rewrite a search algorithm to use an explicit stack.)
If you do write recursive code, prove that recursion is bounded. (Example: searching a balanced tree is bounded by the logarithm of the size of the tree.)
If you can't prove that it's unbounded, add a bound to it. (Example: add a limit to the amount of nesting that a parser supports.)
I don't believe you will find a language mandating this. But a particular implementation could offer such a mechanism, and depending on the operating system it can very well be that the runtime environment enlarges the stack automatically as needed.
According to gcc's documentation, gcc can generate code which does this, if you compile with the -fsplit_stack option:
-fsplit-stack
Generate code to automatically split the stack before it overflows.
The resulting program has a discontiguous stack which can only
overflow if the program is unable to allocate any more memory.
This is most useful when running threaded programs, as it is no
longer necessary to calculate a good stack size to use for each
thread. This is currently only implemented for the i386 and
x86_64 backends running GNU/Linux.
When code compiled with -fsplit-stack calls code compiled
without -fsplit-stack, there may not be much stack space
available for the latter code to run. If compiling all code,
including library code, with -fsplit-stack is not an option,
then the linker can fix up these calls so that the code compiled
without -fsplit-stack always has a large stack. Support for
this is implemented in the gold linker in GNU binutils release 2.21
and later.
The llvm code generation framework provides support for segmented stacks, which are used in the go language and were originally used in Mozilla's rust (although they were removed from rust on the grounds that the execution overhead is too high for a "high-performance language". (See this mailing list thread)
Despite the rust-team's objections, segmented stacks are surprisingly fast, although the stack-thrashing problem can impact particular programs. (Some Go programs suffer from this issue, too.)
Another mechanism for heap-allocating stack segments in a relatively efficient way was proposed by Henry Baker in his 1994 paper Cheney on the MTA and became the basis of the run-time for Chicken Scheme, a compiled mostly R5RS-compatible scheme implementation.
Recursion is certainly not avoided in production code -- it's just used where and when appropriate.
If you're worried about it, the right answer may not simply be to switch to a manually-maintained stack in a vector or whatever -- though you can do that -- but to reorganize the logic. For example, the compiler I was working on replaced one deep recursive process with a worklist-driven process, since there wasn't a need to maintain strict nesting in the order of processing, only to ensure that variables we had a dependency upon were computed before being used.
If you link with a thread library (e.g. pthreads in C), each thread has a separate stack. Those stacks are allocated one way or another, ultimately (in a UNIX environment) with brk or an anonymous mmap. These might or might not use the heap on the way.
I note all the above answers refer to separate stacks; none explicitly says "on the heap" (in the C sense). I am taking it the poster simply means "from dynamically allocated memory" rather than the calling processor stack.

Why some part of an os has to be written in assembly? [duplicate]

This question already has answers here:
Is assembly strictly required to make the "lowest" part of an operating system?
(3 answers)
Closed 9 years ago.
The scheduler of my mini os is written in assembly and I wonder why. I found out that the instruction eret can't be generated by the C compiler, is this somthing that can be generalized to other platforms than Nios and also x86 and/or MIPS architechture? Since I believe that part of os is always written in assembly and I'm searching for why a systems programmer must know assembly to write an operating system. Is is the case that there are builtin limitations of the C compiler that can't generate certain assembly instructions like the eret that returns the program to what is was doing after an interrupt?
The generic answer is for one of three reasons:
Because that particular type of code can't be written in C. I think eret is a "return from exception" instruction, so there is no C equivalent to this (because hardware exceptions such as page faults, divide by zero or similar are not C/C++ style exceptions). Another example may be saving the registers onto the stack when task-switching, and saving the stack pointer into the task-control block. The C code can't do that, because there is no direct access to the stack pointer.
Because the compiler won't produce as good code as someone clever writing assembler. Some specialized operations can be hard to write in C - the compiler may not generate very good code, or the code gets very convoluted to achieve something that is simple in assembler.
The startup of C code needs to be written in assembler, because a C program needs certain things set up before you can run actual C code. For example configuring the stack-pointer, and some other registers.
Yep, that is it. There are instructions that you cannot generate using C language. And there are usually one or some instructions required for an OS so some assembly is required. This is true for pretty much any instruction set, x86, arm, mips, and so on. C compilers let you do inline assembly for jamming instructions in but the language itself cant really handle the nuances of each instruction set and try to account for them. Some compilers will add compiler specific things to for example return the function using an interrupt flavor of return. It is so much easier to just write assembly where needed than to customize the language or compilers so there is really no demand there.
The C language expresses the things it is specified to express: fundamental arithmetic operations, assignment of values to variables, and branches and function calls. Objects may be allocated using static, automatic (local), or dynamic (malloc) storage duration. If you want something outside this conceptual scope, you need something other than pure C.
The C language can be extended arbitrarily, and many platforms define syntax for things like defining a function or variable at a particular address.
But the hardware of the CPU cares about a lot of details, such as the values of flag registers. The part of the scheduler which switches threads needs to be able to save all the registers to memory before doing anything, because overwriting any register would lose essential data in the interrupted thread.
The only way to be able to write such a thing in C, would be for the compiler to provide a C function which generates the finely-tuned assembly. And then you're essentially back at square 1, because the important details are still at the level of the assembly code.
Vendors with multiple product lines of microcontrollers sometimes go out of their way to allow C source compatibility even at the lowest levels, to allow their customers to port code (or conversely, to prevent them from going to another vendor when they need to switch platforms). But the distinction between C and assembly blurs at a certain point, when you're calling pseudo-functions that generate specific instructions (known as intrinsics).
Some things that cannot be done in C or that, if they can be done, are better done in assembly because they are more straightforward and/or maintainable that way include:
Execute return-from-exception and return-from-interrupt instructions.
Read from and write to special processor registers (which control processor state, memory mapping, cache configuration, exception management, and more).
Perform atomic reads and writes to special addresses that are connections to hardware devices rather than memory.
Perform load and store instructions of particular sizes or characteristics to special addresses as described above. (E.g., writing to a certain devices might require using only a store-16-bits instruction and not a regular store-32-bits instruction.)
Execute instructions for memory barriers or ordering, cache control, and flushing of memory maps.
Generally, C is mostly designed to do computations (read inputs, calculate things, write outputs) and not to control a machine (interact with all the controls and devices in the machine).

Switching stacks efficiently

Due to some reason, I switch the stack for calling some functions in my application. I use makecontext/getcontext/swapcontext for that purpose. However, I find it to be too slow.
I tried to use custom made code for that purpose, which saves the stack pointer and other registers and then assign the stack pointer the value of the new memory which I want to use as a stack. However, I keep getting stack smashing detected error.
Are there some special permissions set for the stack by the OS or else what is the matter here? How to circumvent the problem.
The excellent GNU Pth library makes heavy use of these techniques. It's very well documented, and determines the most efficient context switching mechanism at compile time. edit: at configure time actually.
The author's paper: rse-pmt.ps gives a technical account of user-space context switching and related issues - alternative signal stacks, etc.
You could look at other software doing the same dirty tricks as you do. In particular Chicken Scheme. You might perhaps consider using longjmp after manually doing dirty things on the target jmp_buf. Of course, none of this is portable.
But please explain more your overall goal. Your questions are generally too mysterious....
(and that is repulsive to some)
The makecontext()/getcontext()/setcontext()/swapcontext() are quite efficient as they merely save the current values of the processor registers. However, in Linux/GLIBC at least, setcontext() and getcontext() invoke the rt_sigprocmask() system call to save/restore the signal mask of the calling thread. This may be the reason why you face some performance issues as this triggers a context switch into the kernel.

Resources