Usage example of the nonexistent sstk() system call? - c

Keeping with the festivities of Stackoverflow's new logo design, I was curious what the point of sstk() was supposed to be in BSD and other UNIX-Like operating systems?
According to the Linux kernel system call interface manpages, sstk(2) was supposed to:
...[change] the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward.
However, also according to the manual:
This call is not supported in 4.3BSD or 4.4BSD or glibc or Linux or any other known Unix-like system. Some systems have a routine of this name that returns ENOSYS.
Which can be noticed by viewing glibc's sstk.c source
My question is, why would one want to manually change the size of the stack? sbrk() and friends make sense, but is there any use in manually re-sizing the stack size in your program manually?

As I first expressed in comments, the fact that the call is not supported any-known-where suggests that in fact there isn't much point to it, at least not any more.
The docs on linux.die.net attribute the function's heritage to BSD (though it's apparently not supported in modern BSD any more than it is anywhere else), and BSD traces its lineage to bona fide AT&T Unix. It may have made more sense in days when RAM was precious. In those days, you also might not have been able to rely on the stack size being increased automatically. Thus you might, say, enlarge the stack dynamically within a deeply recursive algorithm, and then shrink it back afterward.

Another plausible reason for explicitly growing the stack with a syscall would be so you can get a clean error indication if the request is too big, as opposed to the normal method of handling stack allocation failures (i.e. don't even try, if any allocation fails, just let the process crash).
It will be hard to know exactly how much stack space you need to perform some recursive operation, but you could make a reasonable guess and sstk(guess*10) just to be sure.

Related

How come the stack cannot be increased during runtime in most operating system?

Is it to avoid fragmentation? Or some other reason? A set lifetime for a memory allocation is a pretty useful construct, compared to malloc() which has a manual lifetime.
The space used for stack is increased and decreased frequently during program execution, as functions are called and return. The maximum space allowed for the stack is commonly a fixed limit.
In older computers, memory was a very limited resource, and it may still be in small devices. In these cases, the hardware capability may impose a necessary limit on the maximum stack size.
In modern systems with plenty of memory, a great deal may be available for stack, but the maximum allowed is generally set to some lower value. That provides a means of catching “runaway” programs. Generally, controlling how much stack space a program uses is a neglected part of software engineering. Limits have been set based on practical experience, but we could do better.1
Theoretically, a program could be given a small initial limit and could, if it finds itself working on a “large” problem, tell the operating system it intends to use more. That would still catch most “runaway” programs while allowing well crafted programs to use more space. However, by and large we design programs to use stack for program control (managing function calls and returns, along with a modest amount of space for local data) and other memory for the data being operated on. So stack space is largely a function of program design (which is fixed) rather than problem size. That model has been working well, so we keep using it.
Footnote
1 For example, compilers could report, for each routine that does not use objects with run-time variable size, the maximum space used by the routine in any path through it. Linkers or other tools could report, for any call tree [hence without loops] the maximum stack space used. Additional tools could help analyze stack use in call graphs with potential recursion.
How come the stack cannot be increased during runtime in most operating system?
This is wrong for Linux. On recent Linux systems, each thread has its own call stack (see pthreads(7)), and an application could (with clever tricks) increase some call stacks using mmap(2) and mremap(2) after querying the call stacks thru /proc/ (see proc(5) and use /proc/self/maps) like e.g. pmap(1) does.
Of course, such code is architecture specific, since in some cases the call stack grows towards increasing addresses and in other cases towards decreasing addresses.
Read also Operating Systems: Three Easy Pieces and the OSDEV wiki, and study the source code of GNU libc.
BTW, Appel's book Compiling with Continuations, his old paper Garbage Collection can be faster than Stack Allocation and this paper on Compiling with Continuations and LLVM could interest you, and both are very related to your question: sometimes, there is almost "no call stack" and it makes no sense to "increase it".

When is it more appropriate to use valloc() as opposed to malloc()?

C (and C++) include a family of dynamic memory allocation functions, most of which are intuitively named and easy to explain to a programmer with a basic understanding of memory. malloc() simply allocates memory, while calloc() allocates some memory and clears it eagerly. There are also realloc() and free(), which are pretty self-explanatory.
The manpage for malloc() also mentions valloc(), which allocates (size) bytes aligned to the page border.
Unfortunately, my background isn't thorough enough in low-level intricacies; what are the implications of allocating and using page border-aligned memory, and when is this appropriate as opposed to regular malloc() or calloc()?
The manpage for valloc contains an important note:
The function valloc() appeared in 3.0BSD. It is documented as being obsolete in 4.3BSD, and as legacy in SUSv2. It does not appear in POSIX.1-2001.
valloc is obsolete and nonstandard - to answer your question, it would never be appropriate to use in new code.
While there are some reasons to want to allocate aligned memory - this question lists a few good ones - it is usually better to let the memory allocator figure out which bit of memory to give you. If you are certain that you need your freshly-allocated memory aligned to something, use aligned_alloc (C11) or posix_memalign (POSIX) instead.
Allocations with page alignment usually are not done for speed - they're because you want to take advantage of some feature of your processor's MMU, which typically works with page granularity.
One example is if you want to use mprotect(2) to change the access rights on that memory. Suppose, for instance, that you want to store some data in a chunk of memory, and then make it read only, so that any buggy part of your program that tries to write there will trigger a segfault. Since mprotect(2) can only change permissions page by page (since this is what the underlying CPU hardware can enforce), the block where you store your data had better be page aligned, and its size had better be a multiple of the page size. Otherwise the area you set read-only might include other, unrelated data that still needs to be written.
Or, perhaps you are going to generate some executable code in memory and then want to execute it later. Memory you allocate by default probably isn't set to allow code execution, so you'll have to use mprotect to give it execute permission. Again, this has to be done with page granularity.
Another example is if you want to allocate memory now, but might want to mmap something on top of it later.
So in general, a need for page-aligned memory would relate to some fairly low-level application, often involving something system-specific. If you needed it, you'd know. (And as mentioned, you should allocate it not with valloc, but using posix_memalign, or perhaps an anonymous mmap.)
First of all valloc is obsolete, and memalignshould be used instead.
Second thing it's not part of the C (C++) standard at all.
It's a special allocation which is aligned to _SC_PAGESIZE boundary.
When is it useful to use it? I guess never, unless you have some specific low level requirement. If you would need it, you would know to need it, since it's rarely useful (maybe just when trying some micro-optimizations or creating shared memory between processes).
The self-evident answer is that it is appropriate to use valloc when malloc is unsuitable (less efficient) for the application (virtual) memory usage pattern and valloc is better suited (more efficient). This will depend on the OS and libraries and architecture and application...
malloc traditionally allocated real memory from freed memory if available and by increasing the brk point if not, in which case it is cleared by the OS for security reasons.
calloc in a dumb implementation does a malloc and then (re)clears the memory, while a smart implementation would avoid reclearing newly allocated memory that is automatically cleared by the operating system.
valloc relates to virtual memory. In a virtual memory system using the file system, you can allocate a large amount of memory or filespace/swapspace, even more than physical memory, and it will be swapped in by pages so alignment is a factor. In Unix creation of file of a specified file and adding/deleting pages is done using inodes to define the file but doesn't deal with actual disk blocks till needed, in which case it creates them cleared. So I would expect a valloc system to increase the size of the data segment swap without actually allocating physical or swap pages, or running a for loop to clear it all - as the file and paging system does that as needed. Thus valloc should be a heck of a lot faster than malloc. But as with calloc, how particular idiotsyncratic *x/C flavours do it is up to them, and the valloc man page is totally unhelpful about these expectations.
Traditionally this was implemented with brk/sbrk. Of course in a virtual memory system, whether a paged or a segmented system, there is no real need for any of this brk/sbrk stuff and it is enough to simply write the last location in a file or address space to extend up to that point.
Re the allocation to page boundaries, that is not usually something the user wants or needs, but rather is usually something the system wants or needs.
A (probably more expensive) way to simulate valloc is to determine the page boundary and then call aligned_alloc or posix_memalign with this alignment spec.
The fact that valloc is deprecated or has been removed or is not required in some OS' doesn't mean that it isn't still useful and required for best efficiency in others. If it has been deprecated or removed, one would hope that there are replacements that are as efficient (but I wouldn't bet on it, and might, indeed have, written my own malloc replacement).
Over the last 40 years the tradeoffs of real and (once invented) virtual memory have changed periodically, and mainstream OS has tended to go for frills rather than efficiency, with programmers who don't have (time or space) efficiency as a major imperative. In the embedded systems, efficiency is more critical, but even there efficiency is often not well supported by the standard OS and/or tools. But when in doubt, you can roll your own malloc replacement for your application that does what you need, rather than depend on what someone else woke up and decided to do/implement, or to undo/deprecate.
So the real answer is you don't necessarily want to use valloc or malloc or calloc or any of the replacements your current subversion of an OS provides.

Avoiding stack overflows by allocating stack parts on the heap?

Is there a language where we can enable a mechanism that allocates new stack space on the heap when the original stack space is exceeded?
I remember doing a lab in my university where we fiddled with inline assembly in C to implement a heap-based extensible stack, so I know it should be possible in principle.
I understand it may be useful to get a stack overflow error when developing an app because it terminates a crazy infinite recursion quickly without making your system take lots of memory and begin to swap.
However, when you have a finished well-tested application that you want to deploy and you want it to be as robust as possible (say it's a pretty critical program running on a desktop computer), it would be nice to know it won't miserably fail on some other systems where the stack is more limited, where some objects take more space, or if the program is faced with a very particular case requiring more stack memory than in your tests.
I think it's because of these pitfalls that recursion is usually avoided in production code. But if we had a mechanism for automatic stack expansion in production code, we'd be able to write more elegant programs using recursion knowing it won't unexpectedly segfault while the system has 16 gigabytes of heap memory ready to be used...
There is precedent for it.
The runtime for GHC, a Haskell compiler, uses the heap instead of the stack. The stack is only used when you call into foreign code.
Google's Go implementation uses segmented stacks for goroutines, which enlarge the stack as necessary.
Mozilla's Rust used to use segmented stacks, although it was decided that it caused more problems than it solved (see [rust-dev] Abandoning segmented stacks in Rust).
If memory serves, some Scheme implementations put stack frames on the heap, then garbage collected the frames just like other objects.
In traditional programming styles for imperative languages, most code will avoid calling itself recursively. Stack overflows are rarely seen in the wild, and they're usually triggered by either sloppy programming or by malicious input--especially to recursive descent parsers and the like, which is why some parsers reject code when the nesting exceeds a threshold.
The traditional advice for avoiding stack overflows in production code:
Don't write recursive code. (Example: rewrite a search algorithm to use an explicit stack.)
If you do write recursive code, prove that recursion is bounded. (Example: searching a balanced tree is bounded by the logarithm of the size of the tree.)
If you can't prove that it's unbounded, add a bound to it. (Example: add a limit to the amount of nesting that a parser supports.)
I don't believe you will find a language mandating this. But a particular implementation could offer such a mechanism, and depending on the operating system it can very well be that the runtime environment enlarges the stack automatically as needed.
According to gcc's documentation, gcc can generate code which does this, if you compile with the -fsplit_stack option:
-fsplit-stack
Generate code to automatically split the stack before it overflows.
The resulting program has a discontiguous stack which can only
overflow if the program is unable to allocate any more memory.
This is most useful when running threaded programs, as it is no
longer necessary to calculate a good stack size to use for each
thread. This is currently only implemented for the i386 and
x86_64 backends running GNU/Linux.
When code compiled with -fsplit-stack calls code compiled
without -fsplit-stack, there may not be much stack space
available for the latter code to run. If compiling all code,
including library code, with -fsplit-stack is not an option,
then the linker can fix up these calls so that the code compiled
without -fsplit-stack always has a large stack. Support for
this is implemented in the gold linker in GNU binutils release 2.21
and later.
The llvm code generation framework provides support for segmented stacks, which are used in the go language and were originally used in Mozilla's rust (although they were removed from rust on the grounds that the execution overhead is too high for a "high-performance language". (See this mailing list thread)
Despite the rust-team's objections, segmented stacks are surprisingly fast, although the stack-thrashing problem can impact particular programs. (Some Go programs suffer from this issue, too.)
Another mechanism for heap-allocating stack segments in a relatively efficient way was proposed by Henry Baker in his 1994 paper Cheney on the MTA and became the basis of the run-time for Chicken Scheme, a compiled mostly R5RS-compatible scheme implementation.
Recursion is certainly not avoided in production code -- it's just used where and when appropriate.
If you're worried about it, the right answer may not simply be to switch to a manually-maintained stack in a vector or whatever -- though you can do that -- but to reorganize the logic. For example, the compiler I was working on replaced one deep recursive process with a worklist-driven process, since there wasn't a need to maintain strict nesting in the order of processing, only to ensure that variables we had a dependency upon were computed before being used.
If you link with a thread library (e.g. pthreads in C), each thread has a separate stack. Those stacks are allocated one way or another, ultimately (in a UNIX environment) with brk or an anonymous mmap. These might or might not use the heap on the way.
I note all the above answers refer to separate stacks; none explicitly says "on the heap" (in the C sense). I am taking it the poster simply means "from dynamically allocated memory" rather than the calling processor stack.

Allocating a new call stack

(I think there's a high chance of this question either being a duplicate or otherwise answered here already, but searching for the answer is hard thanks to interference from "stack allocation" and related terms.)
I have a toy compiler I've been working on for a scripting language. In order to be able to pause the execution of a script while it's in progress and return to the host program, it has its own stack: a simple block of memory with a "stack pointer" variable that gets incremented using the normal C code operations for that sort of thing and so on and so forth. Not interesting so far.
At the moment I compile to C. But I'm interested in investigating compiling to machine code as well - while keeping the secondary stack and the ability to return to the host program at predefined control points.
So... I figure it's not likely to be a problem to use the conventional stack registers within my own code, I assume what happens to registers there is my own business as long as everything is restored when it's done (do correct me if I'm wrong on this point). But... if I want the script code to call out to some other library code, is it safe to leave the program using this "virtual stack", or is it essential that it be given back the original stack for this purpose?
Answers like this one and this one indicate that the stack isn't a conventional block of memory, but that it relies on special, system specific behaviour to do with page faults and whatnot.
So:
is it safe to move the stack pointers into some other area of memory? Stack memory isn't "special"? I figure threading libraries must do something like this, as they create more stacks...
assuming any area of memory is safe to manipulate using the stack registers and instructions, I can think of no reason why it would be a problem to call any functions with a known call depth (i.e. no recursion, no function pointers) as long as that amount is available on the virtual stack. Right?
stack overflow is obviously a problem in normal code anyway, but would there be any extra-disastrous consequences to an overflow in such a system?
This is obviously not actually necessary, since simply returning the pointers to the real stack would be perfectly serviceable, or for that matter not abusing them in the first place and just putting up with fewer registers, and I probably shouldn't try to do it at all (not least due to being obviously out of my depth). But I'm still curious either way. Want to know how these sorts of things work.
EDIT: Sorry of course, should have said. I'm working on x86 (32-bit for my own machine), Windows and Ubuntu. Nothing exotic.
All of these answer are based on "common processor architectures", and since it involves generating assembler code, it has to be "target specific" - if you decide to do this on processor X, which has some weird handling of stack, below is obviously not worth the screensurface it's written on [substitute for paper]. For x86 in general, the below holds unless otherwise stated.
is it safe to move the stack pointers into some other area of memory?
Stack memory isn't "special"? I figure threading libraries
must do something like this, as they create more stacks...
The memory as such is not special. This does however assume that it's not on an x86 architecture where the stack segment is used to limit the stack usage. Whilst that is possible, it's rather rare to see in an implementation. I know that some years ago Nokia had a special operating system using segments in 32-bit mode. As far as I can think of right now, that's the only one I've got any contact with that uses the stack segment for as x86-segmentation mode describes.
Assuming any area of memory is safe to manipulate using the stack
registers and instructions, I can think of no reason why it would be a
problem to call any functions with a known call depth (i.e. no
recursion, no function pointers) as long as that amount is available
on the virtual stack. Right?
Correct. Just as long as you don't expect to be able to get back to some other function without switching back to the original stack. Limited level of recursion would also be acceptable, as long as the stack is deep enough [there are certain types of problems that are definitely hard to solve without recursion - binary tree search for example].
stack overflow is obviously a problem in normal code anyway,
but would there be any extra-disastrous consequences to an overflow in
such a system?
Indeed, it would be a tough bug to crack if you are a little unlucky.
I would suggest that you use a call to VirtualProtect() (Windows) or mprotect() (Linux etc) to mark the "end of the stack" as unreadable and unwriteable so that if your code accidentally walks off the stack, it crashes properly rather than some other more subtle undefined behaviour [because it's not guaranteed that the memory just below (lower address) is unavailable, so you could overwrite some other useful things if it does go off the stack, and that would cause some very hard to debug bugs].
Adding a bit of code that occassionally checks the stack depth (you know where your stack starts and ends, so it shouldn't be hard to check if a particular stack value is "outside the range" [if you give yourself some "extra buffer space" between the top of the stack and the "we're dead" zone that you protected - a "crumble zone" as they would call it if it was a car in a crash]. You can also fill the entire stack with a recognisable pattern, and check how much of that is "untouched".
Typically, on x86, you can use the existing stack without any problems so long as:
you don't overflow it
you don't increment the stack pointer register (with pop or add esp, positive_value / sub esp, negative_value) beyond what your code starts with (if you do, interrupts or asynchronous callbacks (signals) or any other activity using the stack will trash its contents)
you don't cause any CPU exception (if you do, the exception handling code might not be able to unwind the stack to the nearest point where the exception can be handled)
The same applies to using a different block of memory for a temporary stack and pointing esp to its end.
The problem with exception handling and stack unwinding has to do with the fact that your compiled C and C++ code contains some exception-handling-related data structures like the ranges of eip with the links to their respective exception handlers (this tells where the closest exception handler is for every piece of code) and there's also some information related to identification of the calling function (i.e. where the return address is on the stack, etc), so you can bubble up exceptions. If you just plug in raw machine code into this "framework", you won't properly extend these exception-handling data structures to cover it, and if things go wrong, they'll likely go very wrong (the entire process may crash or become damaged, despite you having exception handlers around the generated code).
So, yeah, if you're careful, you can play with stacks.
You can use any region you like for the processor's stack (modulo the memory protections).
Essentially, you simply load the ESP register ("MOV ESP, ...") with a pointer to the new area, however you managed to allocate it.
You have to have enough for your program, and whatever it might call (e.g., a Windows OS API), and whatever funny behaviours the OS has. You might be able to figure out how much space your code needs; a good compiler can easily do that. Figuring how much is needed by Windows is harder; you can always allocate "way too much" which is what Windows programs tend to do.
If you decide to manage this space tightly, you'll probably have to switch stacks to call Windows functions. That won't be enough; you'll likely get burned by various Windows surprises. I describe one of them here Windows: avoid pushing full x86 context on stack. I have mediocre solutions, but not good solutions for this.

Why is malloc really non-deterministic? (Linux/Unix)

malloc is not guaranteed to return 0'ed memory. The conventional wisdom is not only that, but that the contents of the memory malloc returns are actually non-deterministic, e.g. openssl used them for extra randomness.
However, as far as I know, malloc is built on top of brk/sbrk, which do "return" 0'ed memory. I can see why the contents of what malloc returns may be non-0, e.g. from previously free'd memory, but why would they be non-deterministic in "normal" single-threaded software?
Is the conventional wisdom really true (assuming the same binary and libraries)
If so, Why?
Edit Several people answered explaining why the memory can be non-0, which I already explained in the question above. What I'm asking is why the program using the contents of what malloc returns may be non-deterministic, i.e. why it could have different behavior every time it's run (assuming the same binary and libraries). Non-deterministic behavior is not implied by non-0's. To put it differently: why it could have different contents every time the binary is run.
Malloc does not guarantee unpredictability... it just doesn't guarantee predictability.
E.g. Consider that
return 0;
Is a valid implementation of malloc.
The initial values of memory returned by malloc are unspecified, which means that the specifications of the C and C++ languages put no restrictions on what values can be handed back. This makes the language easier to implement on a variety of platforms. While it might be true that in Linux malloc is implemented with brk and sbrk and the memory should be zeroed (I'm not even sure that this is necessarily true, by the way), on other platforms, perhaps an embedded platform, there's no reason that this would have to be the case. For example, an embedded device might not want to zero the memory, since doing so costs CPU cycles and thus power and time. Also, in the interest of efficiency, for example, the memory allocator could recycle blocks that had previously been freed without zeroing them out first. This means that even if the memory from the OS is initially zeroed out, the memory from malloc needn't be.
The conventional wisdom that the values are nondeterministic is probably a good one because it forces you to realize that any memory you get back might have garbage data in it that could crash your program. That said, you should not assume that the values are truly random. You should, however, realize that the values handed back are not magically going to be what you want. You are responsible for setting them up correctly. Assuming the values are truly random is a Really Bad Idea, since there is nothing at all to suggest that they would be.
If you want memory that is guaranteed to be zeroed out, use calloc instead.
Hope this helps!
malloc is defined on many systems that can be programmed in C/C++, including many non-UNIX systems, and many systems that lack operating system altogether. Requiring malloc to zero out the memory goes against C's philosophy of saving CPU as much as possible.
The standard provides a zeroing cal calloc that can be used if you need to zero out the memory. But in cases when you are planning to initialize the memory yourself as soon as you get it, the CPU cycles spent making sure the block is zeroed out are a waste; C standard aims to avoid this waste as much as possible, often at the expense of predictability.
Memory returned by mallocis not zeroed (or rather, is not guaranteed to be zeroed) because it does not need to. There is no security risk in reusing uninitialized memory pulled from your own process' address space or page pool. You already know it's there, and you already know the contents. There is also no issue with the contents in a practical sense, because you're going to overwrite it anyway.
Incidentially, the memory returned by malloc is zeroed upon first allocation, because an operating system kernel cannot afford the risk of giving one process data that another process owned previously. Therefore, when the OS faults in a new page, it only ever provides one that has been zeroed. However, this is totally unrelated to malloc.
(Slightly off-topic: The Debian security thing you mentioned had a few more implications than using uninitialized memory for randomness. A packager who was not familiar with the inner workings of the code and did not know the precise implications patched out a couple of places that Valgrind had reported, presumably with good intent but to desastrous effect. Among these was the "random from uninitilized memory", but it was by far not the most severe one.)
I think that the assumption that it is non-deterministic is plain wrong, particularly as you ask for a non-threaded context. (In a threaded context due to scheduling alea you could have some non-determinism).
Just try it out. Create a sequential, deterministic application that
does a whole bunch of allocations
fills the memory with some pattern, eg fill it with the value of a counter
free every second of these allocations
newly allocate the same amount
run through these new allocations and register the value of the first byte in a file (as textual numbers one per line)
run this program twice and register the result in two different files. My idea is that these files will be identical.
Even in "normal" single-threaded programs, memory is freed and reallocated many times. Malloc will return to you memory that you had used before.
Even single-threaded code may do malloc then free then malloc and get back previously used, non-zero memory.
There is no guarantee that brk/sbrk return 0ed-out data; this is an implementation detail. It is generally a good idea for an OS to do that to reduce the possibility that sensitive information from one process finds its way into another process, but nothing in the specification says that it will be the case.
Also, the fact that malloc is implemented on top of brk/sbrk is also implementation-dependent, and can even vary based on the size of the allocation; for example, large allocations on Linux have traditionally used mmap on /dev/zero instead.
Basically, you can neither rely on malloc()ed regions containing garbage nor on it being all-0, and no program should assume one way or the other about it.
The simplest way I can think of putting the answer is like this:
If I am looking for wall space to paint a mural, I don't care whether it is white or covered with old graffiti, since I'm going to prime it and paint over it. I only care whether I have enough square footage to accommodate the picture, and I care that I'm not painting over an area that belongs to someone else.
That is how malloc thinks. Zeroing memory every time a process ends would be wasted computational effort. It would be like re-priming the wall every time you finish painting.
There is an whole ecosystem of programs living inside a computer memmory and you cannot control the order in which mallocs and frees are happening.
Imagine that the first time you run your application and malloc() something, it gives you an address with some garbage. Then your program shuts down, your OS marks that area as free. Another program takes it with another malloc(), writes a lot of stuff and then leaves. You run your program again, it might happen that malloc() gives you the same address, but now there's different garbage there, that the previous program might have written.
I don't actually know the implementation of malloc() in any system and I don't know if it implements any kind of security measure (like randomizing the returned address), but I don't think so.
It is very deterministic.

Resources