Write a faster coroutine in c without setjmp - c

I'm writing a simple coroutine using setjmp and longjmp according to the wiki.
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
jmp_buf orig;
void foo()
{
printf("Hello from a coroutine!\n");
longjmp(orig, 1);
}
int main()
{
char *coroutine_stack = malloc(8192);
coroutine_stack += 8192; //move to the bottom of this allocated space, i.e., top of the coroutine's stack
coroutine_stack -= 16; //make it 16-byte aligned
if(setjmp(orig) == 0) //save the stack frame
{
asm(
"movq %0, %%rsp "
:
: "rm" (coroutine_stack)
:
); //change the stack pointer to the already malloc'd region
foo();
//this should never be touched
return 1;
}
printf("Return from a coroutine\n");
return 0;
}
However, the wiki says there could be a faster solution if I don't use setjmp and longjmp:
Minimalist implementations, which do not piggyback off the setjmp and longjmp functions, may achieve the same result via a small block of inline assembly which swaps merely the stack pointer and program counter, and clobbers all other registers. This can be significantly faster, as setjmp and longjmp must conservatively store all registers which may be in use according to the ABI, whereas the clobber method allows the compiler to store (by spilling to the stack) only what it knows is actually in use.
I'm a little confused about this because I already know that setjmp preserves callee-saved registers, but I don't know a way to instruct the compiler to preserve the registers that are actually in use.

Related

Inline and stack frame control

The following are artificial examples. Clearly compiler optimizations will dramatically change the final outcome. However, and I cannot stress this more: by temporarily disabling optimizations, I intend to have an upper bound on stack usage, likely, I expect that further compiler optimization can improve the situation.
The discussion in centered around GCC only. I would like to have fine control over how automatic variables get released from the stack. Scoping with blocks does not ensure that memory will be released when automatic variables go out of scope. Functions, as far as I know, do ensure that.
However, when inlining, what is the case? For example:
inline __attribute__((always_inline)) void foo()
{
uint8_t buffer1[100];
// Stack Size Measurement A
// Do something
}
void bar()
{
foo();
uint8_t buffer2[100];
// Stack Size Measurement B
// Do something else
}
Can I always expect that at measurement point B, the stack will only containbuffer2 and buffer1 has been released?
Apart from function calls (which result in additional stack usage) is there any way I can have fine control over stack deallocations?
I would like to have fine control over how automatic variables get released from the stack.
Lots of confusion here. The optimizing compiler could store some automatic variables only in registers, without using any slot in the call frame. The C language specification (n1570) does not require any call stack.
And a given register, or slot in the call frame, can be reused for different purposes (e.g. different automatic variables in different parts of the function). Register allocation is a significant role of compilers.
Can I always expect that at measurement point B, the stack will only containbuffer2 and buffer1 has been released?
Certainly not. The compiler could prove that at some later point in your code, the space for buffer1 is not useful anymore so reuse that space for other purposes.
is there any way I can have fine control over stack deallocations?
No, there is not. The call stack is an implementation detail, and might not be used (or be "abused" in your point of view) by the compiler and the generated code.
For some silly example, if buffer1 is not used in foo, the compiler might not allocate space for it. And some clever compilers might just allocate 8 bytes in it, if they can prove that only 8 first bytes of buffer1 are useful.
More seriously, in some cases, GCC is able to do tail-call optimizations.
You should be interested in invoking GCC with -fstack-reuse=all, -Os,
-Wstack-usage=256, -fstack-usage, and other options.
Of course, the concrete stack usage depends upon the optimization levels. You might also inspect the generated assembler code, e.g. with -S -O2 -fverbose-asm
For example, the following code e.c:
int f(int x, int y) {
int t[100];
t[0] = x;
t[1] = y;
return t[0]+t[1];
}
when compiled with GCC8.1 on Linux/Debian/x86-64 using gcc -S -fverbose-asm -O2 e.c gives in e.s
.text
.p2align 4,,15
.globl f
.type f, #function
f:
.LFB0:
.cfi_startproc
# e.c:5: return t[0]+t[1];
leal (%rdi,%rsi), %eax #, tmp90
# e.c:6: }
ret
.cfi_endproc
.LFE0:
.size f, .-f
and you see that the stack frame is not grown by 100*4 bytes. And this is still the case with:
int f(int x, int y, int n) {
int t[n];
t[0] = x;
t[1] = y;
return t[0]+t[1];
}
which actually generates the same machine code as above. And if instead of the + above I'm calling some inline int add(int u, int v) { return u+v; } the generated code is not changing.
Be aware of the as-if rule, and of the tricky notion of undefined behavior (if n was 1 above, it is UB).
Can I always expect that at measurement B, the stack will only containbuffer2 and buffer1 has been released?
No. It's going to depend on GCC version, target, optimization level, options.
Apart from function calls (which result in additional stack usage) is there any way I can have fine control over stack deallocations?
Your requirement is so specific I guess you will likely have to write yourself the code in assembler.
mov BYTE PTR [rbp-20], 1 and mov BYTE PTR [rbp-10], 2 only show the relative offset of stack pointer in stack frame. when considering run-time situation, they have the same peak stack usage.
There are two differences about whether using inline:
1) In function call mode, buffer1 will be released when exit from foo(). But in inline method, buffer1 will not be kept until exit from bar(), that means peak stack usage will last a longer time. 2) Function call will add a few overhead, such as saving stack frame information, comparing with inline mode

How to intercept a static library call in C language?

Here's my question:
There is a static library (xxx.lib) and some C files who are calling function foo() in xxx.lib. I'm hoping to get a notification message every time foo() is called. But I'm not allowed to change any source code written by others.
I've spent several days searching on the Internet and found several similar Q&As but none of these suggestions could really solve my problem. I list some of them:
use gcc -wrap: Override a function call in C
Thank god, I'm using Microsoft C compiler and linker, and I can't find an equivalent option as -wrap.
Microsoft Detours:
Detours intercepts C calls in runtime and re-direct the call to a trampoline function. But Detours is only free for IA32 version, and it's not open source.
I'm thinking about injecting a jmp instruction at the start of function foo() to redirect it to my own function. However it's not feasible when foo()is empty, like
void foo() ---> will be compiled into 0xC3 (ret)
{ but it'll need at least 8 bytes to inject a jmp
}
I found a technology named Hotpatch on MSDN. It says the linker will add serveral bytes of padding at the beginning of each function. That's great, because I can replace the padding bytes with jmp instruction to realize the interception in runtime! But when I use the /FUNCTIONPADMIN option with the linker, it gives me a warning:
LINK : warning LNK4044: unrecognized option '/FUNCTIONPADMIN'; ignored
Anybody could tell me how could I make a "hotpatchable" image correctly? Is it a workable solution for my question ?
Do I still have any hope to realize it ?
If you have the source, you can instrument the code with GCC without changing the source by adding -finstrument-functions for the build of the files containing the functions you are interested in. You'll then have to write __cyg_profile_func_enter/exit functions to print your tracing. An example from here:
#include <stdio.h>
#include <time.h>
static FILE *fp_trace;
void
__attribute__ ((constructor))
trace_begin (void)
{
fp_trace = fopen("trace.out", "w");
}
void
__attribute__ ((destructor))
trace_end (void)
{
if(fp_trace != NULL) {
fclose(fp_trace);
}
}
void
__cyg_profile_func_enter (void *func, void *caller)
{
if(fp_trace != NULL) {
fprintf(fp_trace, "e %p %p %lu\n", func, caller, time(NULL) );
}
}
void
__cyg_profile_func_exit (void *func, void *caller)
{
if(fp_trace != NULL) {
fprintf(fp_trace, "x %p %p %lu\n", func, caller, time(NULL));
}
}
Another way to go if you have source to recompile the library as a shared library. From there it is possible to do runtime insertions of your own .so/.dll using any number of debugging systems. (ltrace on unix, something or other on windows [somebody on windows -- please edit]).
If you don't have source, then I would think your option 3 should still work. Folks writing viruses have been doing it for years. You may have to do some manual inspection (because x86 instructions aren't all the same length), but the trick is to pull out a full instruction and replace it with a jump to somewhere safe. Do what you have to do, get the registers back into the same state as if the instruction you removed had run, then jump to just after the jump instruction you inserted.
The VC compiler provides 2 options /Gh & /GH for hooking functions.
The /Gh flag causes a call to the _penter function at the start of every method or function, and the /GH flag causes a call to the _pexit function at the end of every method or function.
So, if I write some code in _penter to find out the address of the caller function, then I'll be able to intercept any function selectively by comparing the function address.
I made a sample:
#include <stdio.h>
void foo()
{
}
void bar()
{
}
void main() {
bar();
foo();
printf ("I'm main()!");
}
void __declspec(naked) _cdecl _penter( void )
{
__asm {
push ebp; // standard prolog
mov ebp, esp;
sub esp, __LOCAL_SIZE
pushad; // save registers
}
unsigned int addr;
// _ReturnAddress always returns the address directly after the call, but that is not the start of the function!
// subtract 5 bytes as instruction for call _penter
// is 5 bytes long on 32-bit machines, e.g. E8 <00 00 00 00>
addr = (unsigned int)_ReturnAddress() - 5;
if (addr == foo) printf ("foo() is called.\n");
if (addr == bar) printf ("bar() is called.\n");
_asm {
popad; // restore regs
mov esp, ebp; // standard epilog
pop ebp;
ret;
}
}
Build it with cl.exe source.c /Gh and run it:
bar() is called.
foo() is called.
I'm main()!
It's perfect!
More examples about how to use _penter and _pexit can be found here A Simple Profiler and tracing with penter pexit and A Simple C++ Profiler on x64.
I've solved my problem using this method, and I hope it can help you also.
:)
I don't think there is any to do this without changing any code.
Easiest way I can think of is to do this is to write wrapper for your void foo() function and Find/Replace it with your wrapper.
void myFoo(){
return foo();
}
Instead of calling foo() call myFoo().
Hope this will help you.

Print out value of stack pointer

How can I print out the current value at the stack pointer in C in Linux (Debian and Ubuntu)?
I tried google but found no results.
One trick, which is not portable or really even guaranteed to work, is to simple print out the address of a local as a pointer.
void print_stack_pointer() {
void* p = NULL;
printf("%p", (void*)&p);
}
This will essentially print out the address of p which is a good approximation of the current stack pointer
There is no portable way to do that.
In GNU C, this may work for target ISAs that have a register named SP, including x86 where gcc recognizes "SP" as short for ESP or RSP.
// broken with clang, but usually works with GCC
register void *sp asm ("sp");
printf("%p", sp);
This usage of local register variables is now deprecated by GCC:
The only supported use for this feature is to specify registers for input and output operands when calling Extended asm
Defining a register variable does not reserve the register. Other than when invoking the Extended asm, the contents of the specified register are not guaranteed. For this reason, the following uses are explicitly not supported. If they appear to work, it is only happenstance, and may stop working as intended due to (seemingly) unrelated changes in surrounding code, or even minor changes in the optimization of a future version of gcc. ...
It's also broken in practice with clang where sp is treated like any other uninitialized variable.
In addition to duedl0r's answer with specifically GCC you could use __builtin_frame_address(0) which is GCC specific (but not x86 specific).
This should also work on Clang (but there are some bugs about it).
Taking the address of a local (as JaredPar answered) is also a solution.
Notice that AFAIK the C standard does not require any call stack in theory.
Remember Appel's paper: garbage collection can be faster than stack allocation; A very weird C implementation could use such a technique! But AFAIK it has never been used for C.
One could dream of a other techniques. And you could have split stacks (at least on recent GCC), in which case the very notion of stack pointer has much less sense (because then the stack is not contiguous, and could be made of many segments of a few call frames each).
On Linuxyou can use the proc pseudo-filesystem to print the stack pointer.
Have a look here, at the /proc/your-pid/stat pseudo-file, at the fields 28, 29.
startstack %lu
The address of the start (i.e., bottom) of the
stack.
kstkesp %lu
The current value of ESP (stack pointer), as found
in the kernel stack page for the process.
You just have to parse these two values!
You can also use an extended assembler instruction, for example:
#include <stdint.h>
uint64_t getsp( void )
{
uint64_t sp;
asm( "mov %%rsp, %0" : "=rm" ( sp ));
return sp;
}
For a 32 bit system, 64 has to be replaced with 32, and rsp with esp.
You have that info in the file /proc/<your-process-id>/maps, in the same line as the string [stack] appears(so it is independent of the compiler or machine). The only downside of this approach is that for that file to be read it is needed to be root.
Try lldb or gdb. For example we can set backtrace format in lldb.
settings set frame-format "frame #${frame.index}: ${ansi.fg.yellow}${frame.pc}: {pc:${frame.pc},fp:${frame.fp},sp:${frame.sp}} ${ansi.normal}{ ${module.file.basename}{\`${function.name-with-args}{${frame.no-debug}${function.pc-offset}}}}{ at ${ansi.fg.cyan}${line.file.basename}${ansi.normal}:${ansi.fg.yellow}${line.number}${ansi.normal}{:${ansi.fg.yellow}${line.column}${ansi.normal}}}{${function.is-optimized} [opt]}{${frame.is-artificial} [artificial]}\n"
So we can print the bp , sp in debug such as
frame #10: 0x208895c4: pc:0x208895c4,fp:0x01f7d458,sp:0x01f7d414 UIKit`-[UIApplication _handleDelegateCallbacksWithOptions:isSuspended:restoreState:] + 376
Look more at https://lldb.llvm.org/use/formatting.html
You can use setjmp. The exact details are implementation dependent, look in the header file.
#include <setjmp.h>
jmp_buf jmp;
setjmp(jmp);
printf("%08x\n", jmp[0].j_esp);
This is also handy when executing unknown code. You can check the sp before and after and do a longjmp to clean up.
If you are using msvc you can use the provided function _AddressOfReturnAddress()
It'll return the address of the return address, which is guaranteed to be the value of RSP at a functions' entry. Once you return from that function, the RSP value will be increased by 8 since the return address is pop'ed off.
Using that information, you can write a simple function that return the current address of the stack pointer like this:
uintptr_t GetStackPointer() {
return (uintptr_t)_AddressOfReturnAddress() + 0x8;
}
int main(int argc, const char argv[]) {
uintptr_t rsp = GetStackPointer();
printf("Stack pointer: %p\n", rsp);
}
Showcase
You may use the following:
uint32_t msp_value = __get_MSP(); // Read Main Stack pointer
By the same way if you want to get the PSP value:
uint32_t psp_value = __get_PSP(); // Read Process Stack pointer
If you want to use assembly language, you can also use MSP and PSP process:
MRS R0, MSP // Read Main Stack pointer to R0
MRS R0, PSP // Read Process Stack pointer to R0

why addresses of elements in the stack are reversed in ubuntu64?

I write a simple program to print out the addresses of the elements in the stack
#include <stdio.h>
#include <memory.h>
void f(int i,int j,int k)
{
int *pi = (int*)malloc(sizeof(int));
int a =20;
printf("%p,%p,%p,%p,%p\n",&i,&j,&k,&a,pi);
}
int main()
{
f(1,2,3);
return 0;
}
output:(in ubuntu64, unexpected)
0x7fff4e3ca5dc,0x7fff4e3ca5d8,0x7fff4e3ca5d4,0x7fff4e3ca5e4,0x2052010
output:(in ubuntu32 , as expected)
0xbf9525f0,0xbf9525f4,0xbf9525f8,0xbf9525d8,0x931f008
environment for ubuntu64:
$uname -a
Linux 3.8.0-26-generic #38-Ubuntu SMP Mon Jun 17 21:43:33 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$gcc -v
Target: x86_64-linux-gnu
gcc version 4.8.1 (Ubuntu 4.8.1-2ubuntu1~13.04)
According to the diagram above, that the earlier the element has been pushed to the stack, the higher address it will locate,
and if using calling convention cdecl , the rightest parameter will be push to the stack first.
The local variable should be pushed to the stack after pushed the parameters
But the output is reversed in ubuntu64 as expected:
the address of k is :0x7fff4e3ca5d4 //<---should have been pushed to the stack first
the address of j is :0x7fff4e3ca5d8
the address of i is :0x7fff4e3ca5dc
the address of a is :0x7fff4e3ca5e4 //<---should have been pushed to the stack after i,j,k
Any ideas about it?
Even though a clear ABI has been defined for both architectures, compilers do not guarantee that this is respected. You might wonder why, the reason is usually performance. Passing variables into the stack is more expensive in terms of speed than using registers since the application needs to access the memory for retrieving them. Another example of this habit is how compilers use EBP/RBP register. EBP/RBP should be the register which contains the frame-pointer, that is, the stack base address. The stack base register allows for local variables to be easily accessible. However, the frame-pointer register is often used as a general register for increasing the performance. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions, particular important in X86_32 architecture, where usually programs are eager of registers. The main drawback is that makes debugging impossible on some machines. For more info check -fomit-frame-pointer option of gcc.
The calling function between x86_32 and x86_64 are rather different. The most relevant difference is that the x86_64 tries to use general registers to pass the function-arguments and only if there is no register available or the arguments is bigger than 80 bytes, it will use the stack.
We start from the x86_32 ABI, I have slightly changed your example :
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#if defined(__i386__)
#define STACK_POINTER "ESP"
#define FRAME_POINTER "EBP"
#elif defined(__x86_64__)
#define STACK_POINTER "RSP"
#define FRAME_POINTER "RBP"
#else
#error Architecture not supported yet!!
#endif
void foo(int i,int j,int k)
{
int a =20;
uint64_t stack=0, frame_pointer=0;
// Retrieve stack
asm volatile(
#if defined (__i386__)
"mov %%esp, %0\n"
"mov %%ebp, %1\n"
#else
"mov %%rsp, %0\n"
"mov %%rbp, %1\n"
#endif
: "=m"(stack), "=m"(frame_pointer)
:
: "memory");
// retrieve paramters x86_64
#if defined (__x86_64__)
int i_reg=-1, j_reg=-1, k_reg=-1;
asm volatile ( "mov %%rdi, %0\n"
"mov %%rsi, %1\n"
"mov %%rdx, %2\n"
: "=m"(i_reg), "=m"(j_reg), "=m"(k_reg)
:
: "memory");
#endif
printf("%s=%p %s=%p\n", STACK_POINTER, (void*)stack, FRAME_POINTER, (void*)frame_pointer);
printf("%d, %d, %d\n", i, j, k);
printf("%p\n%p\n%p\n%p\n",&i,&j,&k,&a);
#if defined (__i386__)
// Calling convention c
// EBP --> Saved EBP
char * EBP=(char*)frame_pointer;
printf("Function return address : 0x%x \n", *(unsigned int*)(EBP +4));
printf("- i=%d &i=%p \n",*(int*)(EBP+8) , EBP+8 );
printf("- j=%d &j=%p \n",*(int*)(EBP+ 12), EBP+12);
printf("- k=%d &k=%p \n",*(int*)(EBP+ 16), EBP+16);
#else
printf("- i=%d &i=%p \n",i_reg, &i );
printf("- j=%d &j=%p \n",j_reg, &j );
printf("- k=%d &k=%p \n",k_reg ,&k );
#endif
}
int main()
{
foo(1,2,3);
return 0;
}
The ESP register is being used by foo to point to the top of the stack. The EBP register is acting as a "base pointer". All arguments have been pushed in reverse order into the stack. The arguments passed by main to foo and the local variables in foo can all be referenced as an offset from the base pointer. After calling foo the stack should look like : .
Assuming that the compiler is using the stack pointer, we can access the function arguments by summing an offset of 4 byte to the EBP register. Note the first arguments is located at offset 8 because the call instruction push in the stack the return address of the caller function.
printf("Function return address : 0x%x \n", *(unsigned int*)(EBP +4));
printf("- i=%d &i=%p \n",*(int*)(EBP+8) , EBP+8 );
printf("- j=%d &j=%p \n",*(int*)(EBP+ 12), EBP+12);
printf("- k=%d &k=%p \n",*(int*)(EBP+ 16), EBP+16);
This is more or less how arguments are passed to a function in x86_32.
In x86_64 there are more registers available, it makes sense to use them to pass the parameter of a function. The x86_64 ABI can be found here : http://www.uclibc.org/docs/psABI-x86_64.pdf. The calling convention starts at page 14.
First the parameters are divided into classes. The class of each parameter determines the manner in which it is passed to the called function. Some of the most relevant are :
INTEGER This class consists of integral types that fit into one of the
general purpose registers. For example (int, long, bool)
SSE The class consists of types that fits into a SSE register. (float, double)
SSEUP The class consists of types that fit into a SSE register and can
be passed and returned in the most significant half of it. ( float_128, __m128,__m256)
NO_CLASS This class is used as initializer in the
algorithms. It will be used for padding and empty structures and unions.
MEMORY This class consists of types that will be passed and returned in memory
via the stack ( structure types)
Once the a parameter is assigned to a class, it is passed to the function according to
these rules :
MEMORY, pass the argument on the stack.
INTEGER, the next available register of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
SSE, the next available SSE register is used, the registers are taken in the order from %xmm0 to %xmm7.
SSEUP, the eight bytes is passed in the upper half of the last used SSE register.
If there are no registers available for any eightbyte of an argument, the whole
argument is passed on the stack. If registers have already been assigned for some
eightbytes of such an argument, the assignments get reverted. Once registers are assigned, the arguments passed in memory are pushed on the stack in reversed order.
Since you are passing int variables, the arguments will be inserted into the general purpose registers.
%rdi --> i
%rsi --> j
%rdx --> k
So you can retrieve them we the following code :
#if defined (__x86_64__)
int i_reg=-1, j_reg=-1, k_reg=-1;
asm volatile ( "mov %%rdi, %0\n"
"mov %%rsi, %1\n"
"mov %%rdx, %2\n"
: "=m"(i_reg), "=m"(j_reg), "=m"(k_reg)
:
: "memory");
#endif
I hope I have been clear.
In conclusion,
why addresses of elements in the stack are reversed in ubuntu64?
Because they are not stored into the stack. The addresses you have retrieved in that manner are the addresses of the local variables of the caller function.
There is absolutely no restriction on how arguments are passed to a function, nor where they go on the stack (or in a register, or in shared memory for that matter). It is up to the compiler to instrument passing the variables in such a manner that the caller and callee agree upon. Unless you force a specific calling convention (for linking code that was compiled with different compilers), or unless there is a hardware dictated ABI - there is no guarantee.

Creating and using a new stack in memory

For some special reasons (please don't ask me why), for some functions, I want to use a separate stack. So for example, say I want the function malloc to use a different stack for its processing, I need to switch to my newly created stack before it is called and get back to the original stack used by the program after it finishes. So the algorithm would be something like this.
switch_to_new_stack
call malloc
swith back to the original stack
What is the easiest and most efficient way of doing this? Any idea?
It probably doesn't fit your definition of easy or efficient, but the following could be one way to do it:
#include <stdio.h>
#include <stdlib.h>
#include <ucontext.h>
/* utility functions */
static void getctx(ucontext_t* ucp)
{
if (getcontext(ucp) == -1) {
perror("getcontext");
exit(EXIT_FAILURE);
}
}
static void print_sp()
{
#if defined(__x86_64)
unsigned long long x; asm ("mov %%rsp, %0" : "=m" (x));
printf("sp: %p\n",(void*)x);
#elif defined(__i386)
unsigned long x; asm ("mov %%esp, %0" : "=m" (x));
printf("sp: %p\n",(void*)x);
#elif defined(__powerpc__) && defined(__PPC64__)
unsigned long long x; asm ("addi %0, 1, 0" : "=r" (x));
printf("sp: %p\n",(void*)x);
#elif defined(__powerpc__)
unsigned long x; asm ("addi %0, 1, 0" : "=r" (x));
printf("sp: %p\n",(void*)x);
#else
printf("unknown architecture\n");
#endif
}
/* stack for 'my_alloc', size arbitrarily chosen */
static int malloc_stack[1024];
static ucontext_t malloc_context; /* context malloc will run in */
static ucontext_t current_context; /* context to return to */
static void my_malloc(size_t sz)
{
printf("in my_malloc(%zu) ", sz);
print_sp();
}
void call_my_malloc(size_t sz)
{
/* prepare context for malloc */
getctx(&malloc_context);
malloc_context.uc_stack.ss_sp = malloc_stack;
malloc_context.uc_stack.ss_size = sizeof(malloc_stack);
malloc_context.uc_link = &current_context;
makecontext(&malloc_context, (void(*)())my_malloc, 1, sz);
if (swapcontext(&current_context, &malloc_context) == -1) {
perror("swapcontext");
exit(EXIT_FAILURE);
}
}
int main()
{
printf("malloc_stack = %p\n", (void*)malloc_stack);
printf("in main ");
print_sp();
call_my_malloc(42);
printf("in main ");
print_sp();
return 0;
}
This should work on all platforms where makecontext(3) is supported. Quoting from the manpage (where I also got the inspiration for the example code):
The interpretation of ucp->uc_stack is just as in sigaltstack(2), namely, this struct contains the start and length of a memory area to be used as the stack, regardless of the direction of growth of the stack. Thus, it is not necessary for the user program to worry about this direction.
Sample output from PPC64:
$ gcc -o stack stack.c -Wall -Wextra -W -ggdb -std=gnu99 -pedantic -Werror -m64 && ./stack
malloc_stack = 0x10010fe0
in main sp: 0xfffffe44420
in my_malloc(42) sp: 0x10011e20
in main sp: 0xfffffe44420
GCC has support of splitted stacks, which works a bit like you described.
http://gcc.gnu.org/wiki/SplitStacks
The goal of the project is different, but implementation will do what you ask.
The goal of split stacks is to permit a discontiguous stack which is grown automatically as needed. This means that you can run multiple threads, each starting with a small stack, and have the stack grow and shrink as required by the program. It is then no longer necessary to think about stack requirements when writing a multi-threaded program. The memory usage of a typical multi-threaded program can decrease significantly, as each thread does not require a worst-case stack size. It becomes possible to run millions of threads (either full NPTL threads or co-routines) in a 32-bit address space.

Resources