The SysV ABI defines the C-level and assembly calling conventions for Linux.
I would like to write a generic thunk that verifies that a function satisfied the ABI restrictions on callee preserved registers and (perhaps) tried to return a value.
So given a target function like int foo(int, int) it's pretty easy3 to write such a thunk in assembly, something like1:
foo_thunk:
push rbp
push rbx
push r12
push r13
push r14
push r15
call foo
cmp rbp, [rsp + 40]
jne bad_rbp
cmp rbx, [rsp + 32]
jne bad_rbx
cmp r12, [rsp + 24]
jne bad_r12
cmp r13, [rsp + 16]
jne bad_r13
cmp r14, [rsp + 8]
jne bad_r14
cmp r15, [rsp]
jne bad_r15
ret
Now of course I don't actually wan to write a separate foo_thunk method for each call, I just want one generic one. This one should take a pointer to the underlying function (let's say in rax), and would use an indirect call call [rax] than call foo but would otherwise be the same.
What I can't figure out is how to to implement the transparent use of the thunk at the C level (or in C++, where there seems to be more meta-programming options - but let's stick to C here). I want to take something like:
foo(1, 2);
and translate it to a call to the thunk, but still passing the same arguments in the same places (that's needed for the thunk to work).
It is expected that I modify the source, perhaps with macro or template magic, so the call above could be changed to:
CHECK_THUNK(foo, (1, 2));
Giving the macro the name of the underlying function. In principle it could translate this to2:
check_thunk(&foo, 1, 2);
How can I declare check_thunk though? The first argument is "some type" of function pointer. We could try:
check_thunk(void (*ptr)(void), ...);
So a "generic" function pointer (all pointers can validly be cast to this, and we'll only actually call it assembly, outside the claws of the language standard), plus varargs.
This doesn't work though: the ... has totally different promotion rules than a properly prototyped function. It will work for the foo(1, 2) example, but if you call foo(1.0, 2) instead, the varargs version will just leave the 1.0 as a double and you'll be calling foo with a totally wrong value (a double value punned as an integer.
The above also has the disadvantage of passing the function pointer as the first argument, which means the thunk no longer works as-is: it has to save the function pointer in rdi somewhere and then shift all the values over by one (i.e., mov rdi, rsi). If there are non-register args, things get really messy.
Is there any way to make this work smoothly?
Note: this type of thunk is basically incompatible with any passing of parameters on the stack, which is an acceptable limitation of this approach (it should simply not be used for functions with that many arguments or with MEMORY class arguments).
1 This is checks the callee preserved registers, but the other checks are similarly straightforward.
2 In fact, you don't even really need the macro for that - but it's also there so you can turn off the thunk in release builds and just do a direct call.
3 Well by "easy" I guess I mean one that doesn't work in all cases. The shown thunk doesn't correctly align the stack (easy to fix), and breaks if foo has any stack-passed arguments (significantly harder to fix).
One way to do this, in a gcc-specific way, is to take advantage of typeof and nested functions to create a function pointer that embeds the call to the underlying function, but itself doesn't have any arguments.
This pointer can be passed to the thunk method, which calls it and verifies ABI compliance.
Here's an example of transforming a call to int add3(int, int, int) using this method:
The original call looks like:
int res = add3(a, b, c);
Then you wrap the call in a macro, like this2:
CALL_THUNKED(int res, add3, (a,b,c));
... which expands into something like:
typedef typeof(add3 (a,b,c)) ret_type;
ret_type closure() {
return add3 (a,b,c);
}
typedef ret_type (*typed_closure)(void);
typedef ret_type (*thunk_t)(typed_closure);
thunk_t thunk = (thunk_t)closure_thunk;
int res = thunk(&closure);
We create the closure() function on the stack, which calls directly into add3 with the original arguments. We can take the address of this closure and pass it an asm function without difficulty: calling it will have the ultimate effect of calling add3 with the arguments1.
The rest of the typedefs is basically dealing with the return type. We have only a single closure_thunk method, declared like this void* closure_thunk(void (*)(void)); and implemented in assembly. It takes a function pointer (any function pointer is convertible to any other), but the return type is "wrong". We cast it to thunk_t which is a dynamically generated typedef for a function that has the "right" return type.
Of course, that's certainly not legal for C functions, but we are implementing the function in asm, so we kind of sidestep the issue (if you wanted to be a bit more compliant, you could perhaps ask the asm code for a function pointer of the right type, which can "generate" it each time, outside of the reach of the standard: of course it's just returning the same pointer each time).
The closure_thunk function in asm is implemented along the lines of:
GLOBAL closure_thunk:function
closure_thunk:
push rsi
push_callee_saved
call rdi
; set up the function name
mov rdi, [rsp + 48]
; now check whether any regs were clobbered
cmp rbp, [rsp + 40]
jne bad_rbp
cmp rbx, [rsp + 32]
jne bad_rbx
cmp r12, [rsp + 24]
jne bad_r12
cmp r13, [rsp + 16]
jne bad_r13
cmp r14, [rsp + 8]
jne bad_r14
cmp r15, [rsp]
jne bad_r15
add rsp, 7 * 8
ret
That is, push all the registers we want to check on the stack (along with the function name), call the function in rdi and then do your checks. The bad_* methods aren't shown, but they basically spit out an error message like "Function add3 overwrote rbp... naughty!" and abort() the process.
This breaks if any arguments are passed on the stack, but it does work for return values passed on the stack (because the ABI for that case passes a pointer to the location for the return value in `rax).
1 How this is accomplished is kind of magic: gcc actually writes a few bytes of executable code onto the stack, and the closure function pointer points there. The few bytes basically loads a register with a pointer to the region that contains the captured variables (a, b, c in this case), and then calls the actual (read-only) closure() code which then can access the captured variables though that pointer (and pass them to add3).
2 As it turns out, we could probably use gcc's statement expression syntax to write the macro in a more usual function like syntax, something like int res = CALL_THUNKED(add3, (a,b,c)).
At the C source level (without modifying gcc or the linker to insert the thunk for you), you could define different prototypes for each thunk but still share the same implementation.
You could put multiple labels on the definition in the asm source, so check_thunk_foo has the same address as check_thunk_bar, but you can use a different C prototype for each.
Or you could make weak aliases like this:
int check_thunk_foo(void*, int, int)
__attribute__ ((weak, alias ("check_thunk_generic")));
// or maybe this should be ((weakref ("check_thunk_generic")))
#define foo(...) check_thunk_foo((void*)&foo, __VA_ARGS__)
// or to put the args in their original slots,
// but then you'd need different thunks for different numbers of integer args.
#define foo(x, y) check_thunk_foo((x), (y), (void*)&foo)
The major downside to this is that you need to copy+modify the original prototype for every function. You could hack this up with CPP macros so there's a single point of definition for the arg list, and the real prototype (and the thunk if enabled) both use it. Possibly by re-including the same .h twice, with a wrapper macro defined differently. Once for the real prototypes, again for the thunks.
BTW, passing the function pointer as an extra arg to a generic thunk is potentially problematic. I think it's not possible to reliably remove the first arg and forward the rest in the x86-64 SysV ABI. You don't know how many stack args there are, for functions that take more than 6 integer args. And you don't know if there are FP stack args before the first integer stack arg.
This should work fine for functions that pass all their register-possible args in registers. (i.e. if there are any stack args, they're large structs by value or other things that couldn't go in an integer register.)
To solve this problem, the thunk could dispatch based on return address instead of an extra hidden arg, if you had something like debug info to map call site return addresses to call targets. Or you could maybe get gcc to pass a hidden arg in rax or r11. Running call from inline asm sucks a lot, so you'd maybe need to customize gcc with support for some special attribute that passed a function pointer in an extra register.
but if you call foo(1.0, 2) instead, the varargs version will just leave the 1.0 as a double and you'll be calling foo with a totally wrong value (a double value punned as an integer.
Not that it matters, but no, you'd be calling foo(2, garbage) with xmm0=(double)1.0. Variadic functions still use register args the same as non-variadic functions (or with the option of passing FP args on the stack before you run out of registers, and setting al= less than 8).
Related
It is said that returning an oversized struct by value (as opposed to returning a pointer to the struct) from a function incurs unnecessary copy on the stack. By "oversized", I mean a struct that cannot fit in the return registers.
However, to quote Wikipedia
When an oversized struct return is needed, another pointer to a caller-provided space is prepended as the first argument, shifting all other arguments to the right by one place.
and
When returning struct/class, the calling code allocates space and passes a pointer to this space via a hidden parameter on the stack. The called function writes the return value to this address.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
If the function inlines, the copying through the return-value object can be fully optimized away. Otherwise, maybe not, and arg copying definitely can't be.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
It depends what the caller does with the return value,; if it's assigned to a provably private object (escape analysis), that object can be the return-value object, passed as the hidden pointer.
But if the caller actually wants to assign the return value to other memory, then it does need a temporary.
struct large retval = some_func(); // no extra copying at all
*p = some_func() // caller will make space for a local return-value object & copy.
(Unless the compiler knows that p is just pointing to a local struct large tmp;, and escape analysis can prove that there's no way some global variable could have a pointer to that same tmp var.)
long version, same thing with more details:
In the C abstract machine, there's a "return value object", and return foo copies the named variable foo to that object, even if it's a large struct. Or return (struct lg){1,2}; copies an anonymous struct. The return-value object itself is anonymous; nothing can take its address. (You can't int *p = &foo(123);). This makes it easier to optimize away.
In the caller, that anonymous return-value object can be assigned to whatever you want, which would be another copy if compilers didn't optimize anything. (All of this applies for any type, even int). Of course, compilers that aren't total garbage will avoid some, ideally all, of that copying, when doing so can't possibly change the observable results. And that depends on the design of the calling convention. As you say, most conventions, including all the mainstream x86 and x86-64 conventions, pass a "hidden pointer" arg for return values they choose not to return in register(s) for whatever reason (size, C++ having a non-trivial constructor).
struct large retval = foo(...);
For such calling conventions, the above code is effectively transformed to
struct large retval;
foo(&retval, ...);
So it's C return-value object actually is a local in the stack-frame of its caller. foo() is allowed to store into that return-value object whenever it wants during execution, including before reading some other objects. This allows optimization within the callee (foo) as well, so a struct large tmp = ... / return tmp can be optimized away to just store into the return-value object.
So there's zero extra copying when the caller does just want to assign the function return value to a newly declared local var. (Or to a local var which it can prove is still private, via escape analysis. i.e. not pointed-to by any global vars).
But what if the caller wants to store the return value somewhere else?
void caller2(struct large *lgp) {
*lgp = foo();
}
Can *lgp be the return-value object, or do we need to introduce a local temporary?
void caller2(struct large *lgp) {
// foo_asm(lgp); // nope, possibly unsafe
struct large retval; foo(&retval); *lgp = retval; // safe
}
If you want functions to be able to write large structs to arbitrary locations, you have to "sign off" on it by making that effect visible in your source.
What prevents the usage of a function argument as hidden pointer? for more details about why *lgp can't be the return-value object / hidden pointer, and another example. "A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else". Also details of whether struct large *restrict lgp would make it safe: probably yes if the function doesn't longjmp (otherwise stores to the supposedly anonymous retval object might end up as visible side effects without return having been reached), but GCC doesn't look for that optimization.
Why is tailcall optimization not performed for types of class MEMORY? - return bar() where bar returns the same struct should be possible as an optimized tailcall, causing extra copying. This can even introduce extra copying of the whole struct, as well as failing to optimize call bar / ret into jmp bar.
how c compiler treats a struct return value from a function, in ASM - thresholds for returning in registers. e.g. i386 System V always returns structs in memory, even struct {int x;};.
Is it possible within a function to get the memory address of the variable initialized by the return value?
C/C++ returning struct by value under the hood an actual example (but unfortunately using debug-mode compiler-generated asm, so it contains copying that isn't necessary).
How do objects work in x86 at the assembly level? example at the bottom of how x86-64 System V packs the bytes of a struct into RDX:RAX, or just RAX if less than 8 bytes.
An example showing early stores to the return-value object (instead of copying)
(all source + asm on the Godbolt compiler explorer)
// more or less extra size will get compilers to copy it around with SSE2 or not
struct large { int first, second; char pad[0];};
int *global_ptr;
extern int a;
NOINLINE // __attribute__((noinline))
struct large foo() {
struct large tmp = {1,2};
if (a)
tmp.second = *global_ptr;
return tmp;
}
(targeting GNU/Linux) clang -m32 -O3 -mregparm=1 creates an implementation that writes its return-value object before it's done reading everything else, exactly the case that would make it unsafe for the caller to pass a pointer to some globally-reachable memory.
The asm makes it clear that tmp is fully optimized away, or is the retval object.
# clang -O3 -m32 -mregparm=1
foo:
mov dword ptr [eax + 4], 2
mov dword ptr [eax], 1 # store tmp into the retval object
cmp dword ptr [a], 0
je .LBB0_2 # if (a == 0) goto ret
mov ecx, dword ptr [global_ptr] # load the global
mov ecx, dword ptr [ecx] # deref it
mov dword ptr [eax + 4], ecx # and store to the retval object
.LBB0_2:
ret
(-mregparm=1 means pass the first arg in EAX, less noisy and easier to quickly visually distinguish from stack space than passing on the stack. Fun fact: i386 Linux compiles the kernel with -mregparm=3. But fun fact #2: if a hidden pointer is passed on the stack (i.e. no regparm), that arg is callee pops, unlike the rest. The function will use ret 4 to do ESP+=4 after popping the return address into EIP.)
In a simple caller, the compiler just reserves some stack space, passes a pointer to it, and then can load member variables from that space.
int caller() {
struct large lg = {4, 5}; // initializer is dead, foo can't read its retval object
lg = foo();
return lg.second;
}
caller:
sub esp, 12
mov eax, esp
call foo
mov eax, dword ptr [esp + 4]
add esp, 12
ret
But with a less trivial caller:
int caller() {
struct large lg = {4, 5};
global_ptr = &lg.first;
// unknown(&lg); // or this: as a side effect, might set global_ptr = &tmp->first;
lg = foo(); // (except by inlining) the compiler can't know if foo() looks at global_ptr
return lg.second;
}
caller:
sub esp, 28 # reserve space for 2 structs, and alignment
mov dword ptr [esp + 12], 5
mov dword ptr [esp + 8], 4 # materialize lg
lea eax, [esp + 8]
mov dword ptr [global_ptr], eax # point global_ptr at it
lea eax, [esp + 16] # hidden first arg *not* pointing to lg
call foo
mov eax, dword ptr [esp + 20] # reload from the retval object
add esp, 28
ret
Extra copying with *lgp = foo();
int caller2(struct large *lgp) {
global_ptr = &lgp->first;
*lgp = foo();
return lgp->second;
}
# with GCC11.1 this time, SSE2 8-byte copying unlike clang
caller2: # incoming arg: struct large *lgp in EAX
push ebx #
mov ebx, eax # lgp, tmp89 # lgp needed after foo returns
sub esp, 24 # reserve space for a retval object (and waste 16 bytes)
mov DWORD PTR global_ptr, eax # global_ptr, lgp
lea eax, [esp+8] # hidden pointer to the retval object
call foo #
movq xmm0, QWORD PTR [esp+8] # 8-byte copy of both halves
movq QWORD PTR [ebx], xmm0 # *lgp_2(D), tmp86
mov eax, DWORD PTR [ebx+4] # lgp_2(D)->second, lgp_2(D)->second # reload int return value
add esp, 24
pop ebx
ret
The copy to *lgp needs to happen, but it's somewhat of a missed optimization to reload from there, instead of from [esp+12]. (Saves a byte of code size at the cost of more latency.)
Clang does the copy with two 4-byte integer register mov loads/stores, but one of them is into EAX so it already has the return value ready.
You might also want to look at the result of assigning to memory freshly allocated with malloc. Compilers know that nothing else can (legally) be pointing to the newly allocated memory: that would be use-after-free undefined behaviour. So they may allow passing on a pointer from malloc as the return-value object if it hasn't been passed to anything else yet.
Related fun fact: passing large structs by value always requires a copy (if the function doesn't inline). But as discussed in comments, the details depend on the calling convention. Windows differs from i386 / x86-64 System V calling conventions (all non-Windows OSes) on this:
SysV calling conventions copy the whole struct to the stack. (if they're too large to fit in a pair of registers for x86-64)
Windows x64 makes a copy and passes (like a normal arg) a pointer to that copy. The callee "owns" the arg and can modify it, so a tmp copy is still needed. (And no, const struct large foo has no effect.)
https://godbolt.org/z/ThMrE9rqT shows x86-64 GCC targeting Linux vs. x64 MSVC targeting Windows.
This really depends on your compiler, but in general the way this works is that the caller allocates the memory for the struct return value, but the callee also allocates stack space for any intermediate value of that structure. This intermediate allocation is used when the function is running, and then the struct is copied onto the caller's memory when the function returns.
For reference as to why your solution won't always work, consider a program which has two of the same struct and returns one based on some condition:
large_t returntype(int condition) {
large_t var1 = {5};
large_t var2 = {6};
// More intermediate code here
if(condition) return var1;
else return var2;
}
In this case, both may be required by the intermediate code, but the return value is not known at compile time, so the compiler doesn't know which to initialize on the caller's stack space. It's easier to just keep it local and copy on return.
EDIT: Your solution may be the case in simple functions, but it really depends on the optimizations performed by each individual compiler. If you're really interested in this, check out https://godbolt.org/
So I was working on a project and was a little bored and thought about how to break C really hard:
Is it be possible, to trick the compiler in using jumps (goto) for a function call? - Maybe, I answered to myself. So after a bit of working and doing I realised, that some pointer stuff wasn't working correctly, but in an (at least for me) unexpected way: the goto wouldn't work as intended. After a little bit of experimenting, I came up with this stuff (comments removed, since I sometimes keep unused code in them, when testing):
//author: me, The Array :)
#include <stdio.h>
void * func_return();
void (*break_ptr)(void) = (void *)func_return;
void * func_return(){
printf("ok2\n");
break_ptr = &&test2;
return NULL;
if(0 == 1){
test2:
printf("sh*t\n");
}
}
void scoping(){
printf("beginning of scoping\n");
break_ptr();
printf("after func call #1\n");
break_ptr();
printf("!!!YOU WILL NOT SEE THIS!!!!\n");
}
int main(){
printf("beginning of programm\n");
scoping();
printf("ending programm\n");
}
I used gcc to compile this as I don't know any other compiler, that supports the use of that &&!
My platform is windows 64 bit and I used that most basic way to compile this:
gcc.exe "however_you_want_to_call_it.c" -o "however_you_want_to_call_it.exe"
When looking over that code I expected and wanted it to print "sh*t\n" to the console window (of course the \n will be invisible). But it turns out gcc is somewhat too smart for me? I guess this comes, when trying to break something..
Infact, as the title says, it returns twice:
beginning of programm
beginning of scoping
ok2
after func call #1
ok2
ending programm
It does not return twice, like the fork function and propably prints the following stuff twice or sth., no it returns out of the function AND the function that called it. So after the second call it does not print "!!!YOU WILL NOT SEE THIS!!!!\n" to the console, but rather "ending programm", as it returned twice. (I am trying to amplify the fact, that the "ending programm" is printed, as the programm does not crash)
So the reason, why I posted that here, is the following: my questions..
Why does it not go to/ jump to/ call to the actual test2 label and instead goes to the beginning of that function?
How would I achieve the thing of my first question?
Why does it return twice? I figured it is propably a compiler thing instead of a runtime thing, but I guess I'll wait for someones answer
Can the same thing (the returning twice) be achieved the first time the function "break_ptr" is called, instead of the second time?
I do not know and do not care if this also works in c++.
Now I can see many ways this can be usefull, some malicious and some actually good. For example could you code an enterprise function, which returns your function. Enterprise solutions to problems tend to be weird, so why not make a function which returns your code, idk..
Yet it can be malicious, for example, when some code is returning unexpectatly or even without return values.. I can imagine this existing in a dll file and a header file which simply reads "extern void *break_ptr();" or sth.. did not test it. (Yet there are way crueler ways to mess with someone..)
I could not find this documented anywhere on the internet. Please send me some links or references about this, if you find some, I want to learn more about it.
If this is "just" a bug and someone of the gnu/gcc guys is reading this: Please do NOT remove it, as it is too much fun working with these things.
Thank you in advance for your answers and your time and I am sorry for making this so long. I wanted to make sure everything collected about this is in one place. (Yet still I am sorry if I missed something..)
From gcc documentation on labels of values:
You may not use this mechanism to jump to code in a different function. If you do that, totally unpredictable things happen.
The behavior you are seeing is properly documented. Inspect the generated assembly to really know what code does the compiler generate.
The assembly from godbolt on gcc10.2 with no optimizations:
break_ptr:
.quad func_return
.LC0:
.string "ok2"
func_return:
push rbp
mov rbp, rsp
.L2:
mov edi, OFFSET FLAT:.LC0
call puts
mov eax, OFFSET FLAT:.L2
mov QWORD PTR break_ptr[rip], rax
mov eax, 0
pop rbp
ret
.LC1:
.string "beginning of scoping"
.LC2:
.string "after func call #1"
.LC3:
.string "!!!YOU WILL NOT SEE THIS!!!!"
scoping:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC1
call puts
mov rax, QWORD PTR break_ptr[rip]
call rax
mov edi, OFFSET FLAT:.LC2
call puts
mov rax, QWORD PTR break_ptr[rip]
call rax
mov edi, OFFSET FLAT:.LC3
call puts
nop
pop rbp
ret
.LC4:
.string "beginning of programm"
.LC5:
.string "ending programm"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC4
call puts
mov eax, 0
call scoping
mov edi, OFFSET FLAT:.LC5
call puts
mov eax, 0
pop rbp
ret
shows that .L2 label was placed on top of function and the if (0 == 1) { /* this */ } was optimized out by the compiler. When you jump on .L2 you jump to beginning of the function, except that stack is incorrectly setup, because push rbp is omitted.
Why does it not go to/ jump to/ call to the actual test2 label and instead goes to the beginning of that function?
Because the documentation says that if you jump to another function "totally unpredictable things happen"
How would I achieve the thing of my first question?
Hard to say, since "jumping into a function" is not really something you should do.
Why does it return twice? I figured it is propably a compiler thing instead of a runtime thing, but I guess I'll wait for someones answer
Because returning twice is an element of the set of "unpredictable things"
Can the same thing (the returning twice) be achieved the first time the function "break_ptr" is called, instead of the second time?
See above. What you're doing will cause unpredictable things.
And just to point it out, your code has other flaws that may or may not be a part of this. func_return is a function taking an unspecified number of arguments returning a void pointer. break_ptr is a function taking NO arguments and returning void. The proper pointer would be
void * func_return();
void *(*break_ptr)() = func_return;
Notice three things. Apart from removing the cast, I removed void from the parenthesis and added an asterisk. But a better alternative would be
void * func_return(void);
void *(*break_ptr)(void) = func_return;
The main thing here is, do NOT cast to silence the compiler. Fix the problem instead. Read more about casting here
Your cast invokes undefined behavior, which essentially is the same thing as "unpredictable things happen".
Also, you're missing a return statement in that function.
void * func_return(){
printf("ok2\n");
break_ptr = &&test2;
return NULL;
if(0 == 1){
test2:
printf("sh*t\n");
}
// What happens here?
}
Omitting the return statement can only safely be done in a function returning void but this function returns void*. Omitting it will cause undefined behavior which, again, means that unpredictable things happen.
My programming language compiles to C, I want to implement tail recursion optimization. The question here is how to pass control to another function without "returning" from the current function.
It is quite easy if the control is passed to the same function:
void f() {
__begin:
do something here...
goto __begin; // "call" itself
}
As you can see there is no return value and no parameters, those are passed in a separate stack adressed by a global variable.
Another option is to use inline assembly:
#ifdef __clang__
#define tail_call(func_name) asm("jmp " func_name " + 8");
#else
#define tail_call(func_name) asm("jmp " func_name " + 4");
#endif
void f() {
__begin:
do something here...
tail_call(f); // "call" itself
}
This is similar to goto but as goto passes control to the first statement in a function, skipping the "entry code" generated by a compiler, jmp is different, it's argument is a function pointer, and you need to add 4 or 8 bytes to skip the entry code.
The both above will work but only if the callee and the caller use the same amount of stack for local variables which is allocated by the entry code of the callee.
I was thinking to do leave manually with inline assembly, then replace the return address on the stack, then do a legal function call like f(). But my attempts all crashed. You need to modify BP and SP somehow.
So again, how to implement this for x64? (Again, assuming functions have no arguments and return void). Portable way without inline assembly is better, but assembly is accepted. Maybe longjump can be used?
Maybe you can even push the callee address on the stack, replacing the original return address and just ret?
Do not try to do this yourself. A good C compiler can perform tail-call elimination in many cases and will do so. In contrast, a hack using inline assembly has a good chance of going wrong in a way that is difficult to debug.
For example, see this snippet on godbolt.org. To duplicate it here:
The C code I used was:
int foo(int n, int o)
{
if (n == 0) return o;
puts("***\n");
return foo(n - 1, o + 1);
}
This compiles to:
.LC0:
.string "***\n"
foo:
test edi, edi
je .L4
push r12
mov r12d, edi
push rbp
mov ebp, esi
push rbx
mov ebx, edi
.L3:
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
jne .L3
lea eax, [r12+rbp]
pop rbx
pop rbp
pop r12
ret
.L4:
mov eax, esi
ret
Notice that the tail call has been eliminated. The only call is to puts.
Since you don't need arguments and return values, how about combining all function into one and use labels instead of function names?
f:
__begin:
...
CALL(h); // a macro implementing traditional call
...
if (condition_ret)
RETURN; // a macro implementing traditional return
...
goto g; // tail recurse to g
The tricky part here is RETURN and CALL macros. To return you should keep yet another stack, a stack of setjump buffers, so when you return you call longjump(ret_stack.pop()), and when you call you do ret_stack.push(setjump(f)). This is poetical rendition ofc, you'll need to fill out the details.
gcc can offer some optimization here with computed goto, they are more lightweight than longjump. Also people who write vms have similar problems, and seemingly have asm-based solutions for those even on MSVC, see example here.
And finally such approach even if it saves memory, may be confusing to compiler, so can cause performance anomalies. You probably better off generating for some portable assembler-like language, llvm maybe? Not sure, should be something that has computed goto.
The venerable approach to this problem is to use trampolines. Essentially, every compiled function returns a function pointer (and maybe an arg count). The top level is a tight loop that, starting with your main, simply calls the returned function pointer ad infinitum. You could use a function that longjmps to escape the loop, i.e., to terminate the progam.
See this SO Q&A. Or Google "recursion tco trampoline."
For another approach, see Cheney on the MTA, where the stack just grows until it's full, which triggers a GC. This works once the program is converted to continuation passing style (CPS) since in that style, functions never return; so, after the GC, the stack is all garbage, and can be reused.
I will suggest a hack. The x86 call instruction, which is used by the compiler to translate your function calls, pushes the return address on the stack and then performs a jump.
What you can do is a bit of a stack manipulation, using some inline assembly and possibly some macros to save yourself a bit of headache. You basically have to overwrite the return address on the stack, which you can do immediately in the function called. You can have a wrapper function which overwrites the return address and calls your function - the control flow will then return to the wrapper which then moves to wherever you pointed it to.
Today, I played around with incrementing function pointers in assembly code to create alternate entry points to a function:
.386
.MODEL FLAT, C
.DATA
INCLUDELIB MSVCRT
EXTRN puts:PROC
HLO DB "Hello!", 0
WLD DB "World!", 0
.CODE
dentry PROC
push offset HLO
call puts
add esp, 4
push offset WLD
call puts
add esp, 4
ret
dentry ENDP
main PROC
lea edx, offset dentry
call edx
lea edx, offset dentry
add edx, 13
call edx
ret
main ENDP
END
(I know, technically this code is invalid since it calls puts without the CRT being initialized, but it works without any assembly or runtime errors, at least on MSVC 2010 SP1.)
Note that in the second call to dentry I took the address of the function in the edx register, as before, but this time I incremented it by 13 bytes before calling the function.
The output of this program is therefore:
C:\Temp>dblentry
Hello!
World!
World!
C:\Temp>
The first output of "Hello!\nWorld!" is from the call to the very beginning of the function, whereas the second output is from the call starting at the "push offset WLD" instruction.
I'm wondering if this kind of thing exists in languages that are meant to be a step up from assembler like C, Pascal or FORTRAN. I know C doesn't let you increment function pointers but is there some other way to achieve this kind of thing?
AFAIK you can only write functions with multiple entry-points in asm.
You can put labels on all the entry points, so you can use normal direct calls instead of hard-coding the offsets from the first function-name.
This makes it easy to call them normally from C or any other language.
The earlier entry points work like functions that fall-through into the body of another function, if you're worried about confusing tools (or humans) that don't allow function bodies to overlap.
You might do this if the early entry-points do a tiny bit of extra stuff, and then fall through into the main function. It's mainly going to be a code-size saving technique (which might improve I-cache / uop-cache hit rate).
Compilers tend to duplicate code between functions instead of sharing large chunks of common implementation between slightly different functions.
However, you can probably accomplish it with only one extra jmp with something like:
int foo(int a) { return bigfunc(a + 1); }
int bar(int a) { return bigfunc(a + 2); }
int bigfunc(int x) { /* a lot of code */ }
See a real example on the Godbolt compiler explorer
foo and bar tailcall bigfunc, which is slightly worse than having bar fall-through into bigfunc. (Having foo jump over bar into bigfunc is still good, esp. if bar isn't that trivial.)
Jumping into the middle of a function isn't in general safe, because non-trivial functions usually need to save/restore some regs. So the prologue pushes them, and the epilogue pops them. If you jump into the middle, then the pops in the prologue will unbalance the stack. (i.e. pop off the return address into a register, and return to a garbage address).
See also Does a function with instructions before the entry-point label cause problems for anything (linking)?
You can use the longjmp function: http://www.cplusplus.com/reference/csetjmp/longjmp/
It's a fairly horrible function, but it'll do what you seek.
I'm using an AMD 64bit (I don't think it matters what exact architecture) on Linux, also 64bit. Compiling with gcc to elf64.
I've seen from the C ABI that integer arguments are passed to a function via general purpose registers, and I can find the values on the assembly side of my code (the callee). The problem arises when I need to retrieve the results from the callee to the caller.
As far as I can understand, RAX gets the 1st integer returned value, and I can easily find and use that one. The 2nd integer returned value is passed via RDX. And this is the point that baffles me.
I can also see, from the C ABI, that RDX is the register used to pass the third integer function argument from the caller to the callee, but my function doesn't use a third argument.
How do I get the RDX out from my function? Do I have to fake an argument in the function just to be able to refer to it on the caller side?
fixed point multiplication 16.16:
called from C looks like:
typedef long int Fixedpoint;
Fixedpoint _FixedMul(Fixedpoint v1, Fixedpoint v2);
and this is the function itself:
_FixedMul:
push bp
mov bp, sp
; entering the function EDI contains v1, ESI contains v2. So:
mov eax, edi ; eax = v1
imul dword esi ; eax = v1 * v2
; at this point EDX contains the higher part of the
; imul moltiplication, EAX the lower one.
add eax, 8000h ; round by adding 2^(-17)
adc edx, 0 ; whole part of result is in DX
shr eax, 16 ; put the fractional part in AX
pop bp
ret
from System V Application Binary Interface
AMD64 Architecture Processor Supplement
Returning of Values algorithm:
The returning of values is done according to the following
Classify the return type with the classification algorithm.
If the type has class MEMORY, then the caller provides space for the return
value and passes the address of this storage in %rdi as if it were the first
argument to the function. In effect, this address becomes a “hidden” first ar-
gument. This storage must not overlap any data visible to the callee through
other names than this argument.
On return %rax will contain the address that has been passed in by the
caller in %rdi.
If the class is INTEGER, the next available register of the sequence %rax,
%rdx is used.
I hope is more clear what i mean.
PS: sorry for confusion I've made in comments. Thanks for the tip.
The AMD64 calling conventions (System V ABI) designate a dual-use role for register RDX: it may be used to pass the third argument (if any) and it may be used as second return register. (cf. Figure 3.4, page 21)
Depending on the function signature it's used for either or none of these roles.
So why are there 2 return registers (i.e. RAX and RDX)? This allows a function to return values of up to 128 bits in general purpose registers - such as __int128 or a struct with two 8 byte fields.
Thus, to access the RDX register from your C call site you just have to adjust your function's signature. That means instead of
typedef long int Fixedpoint;
Fixedpoint _FixedMul(Fixedpoint v1, Fixedpoint v2);
declare it as:
typedef long int Fixedpoint;
struct Fixedpoint_Pair { // when used as return value:
Fixedpoint fst; // - passed in RAX
Fixedpoint snd; // - passed in RDX
};
typedef struct Fixedpoint_Pair Fixedpoint_Pair;
Fixedpoint_Pair _FixedMul(Fixedpoint v1, Fixedpoint v2);
Then you can access RDX like this in your C code:
Fixedpoint_Pair r = _FixedMul(a, b);
printf("RDX: %ld\n", r.snd);
References:
page 18,19: especially 'The classification of aggregate (structure and arrays)', point 3:
If the size of the aggregate exceeds a single eightbyte, each is classified
separately. Each eightbyte gets initialized to class NO_CLASS.
page 22: 'Return of Values', point 3:
If the class is INTEGER, the next available register of the sequence %rax,
%rdx is used.
Your English is absolutely fine.
How arguments are passed depends on the calling convention that has been adopted - the two most common ones, __stdcall and __cdecl, use the stack to pass all arguments, but for example the __fastcall convention will use registers for the first two args, and in x64, it's still different. See here for a comprehensive list.
Actual return values are returned in (E/R)AX; however, if the function receives some pointers to "return" more values, those pointers will be passed as normal arguments - as previously stated, usually on the stack.