Calling a C function from Assembly -- switching calling convention - c

I have an assembly application for Linux x64 where I pass arguments to the functions via registers, thus I'm using a certain a certain calling convention, in this case fastcall. Now I want to call a C function from the assembly application which, say, expects 10 arguments. Do I have to switch to cdecl for that and pass the arguments via stack regardless of the fact everywhere else in my application I'm passing them via registers? Is it allowed to mix calling conventions in one application?

I assume that by fastcall, you mean the amd64 calling convention used by the SysV ABI (i.e. what Linux uses) where the first few arguments are passed in rdi, rsi, and rdx.
The ABI is slightly complicated, the following is a simplification. You might want to read the specification for details.
Generally speaking, the first few (leftmost) integer or pointer arguments are placed into the registers rdi, rsi, rdx, rcx, r8, and r9. Floating point arguments are passed in xmm0 to xmm7. If the register space is exhausted, additional arguments are passed through the stack from right to left. For example, to call a function with 10 integer arguments:
foo(a, b, c, d, e, f, g, h, i, k);
you would need code like this:
mov $a,%edi
mov $b,%esi
mov $c,%edx
mov $d,%ecx
mov $e,%r8d
mov $f,%r9d
push $k
push $i
push $h
push $g
call foo
add $32,%rsp
For your concrete example, of getnameinfo:
int getnameinfo(
const struct sockaddr *sa,
socklen_t salen,
char *host,
size_t hostlen,
char *serv,
size_t servlen,
int flags);
You would pass sa in rdi, salen in rsi, host in rdx, hostlen in rcx, serv in r8, servlen in r9 and flags on the stack.

Yes of course. Calling convention is applied on per-function basis. This is a perfectly valid application:
int __stdcall func1()
{
return(1);
}
int __fastcall func2()
{
return(2);
}
int __cdecl main(void)
{
func1();
func2();
return(0);
}

You can, but you don't need to.
__attribute__((fastcall)) only asks for the first two parameters to be passed in registers - everything else will anyhow automatically be passed on the stack, just like with cdecl. This is done in order to not limit the number of parameters that can be given to a function by chosing a certain calling convention.
In your example with 10 parameters for a function that is called with the fastcall calling convention, the first two parameters will be passed in registers, the remaining 8 automatically on the stack, just like with standard calling convention.
As you have chosen to use fastcall for all your other functions, I do not see a reason why you'd want to change this for one specific function.

Related

Try to pass argument to C function in Nasm elf64 but it return SIGFPE error

I try to implement the C sqrt function in Nasm elf64, it works correctly without argument (with the value of "a" define in the function), but when I try to pass an argument to this function ,the code return an error "Stopped reason: SIGFPE".
Here's my code
The c function
int le_sqrt(int a) {
int n=1;
int number = a;
for (int i=0; i<10; i++) {
n=(n+number/n)/2;
}
return n;
}
The nasm program
bits 64
global _start
extern le_sqrt
_start:
mov rbp,9 ;argument
push rbp ;same argument push
call le_sqrt ; my c function
mov rax,60 ;exit program
mov rsi,0
syscall
If you want to call le_sqrt(9) with System V AMD64 ABI calling convention, do this:
_start:
mov rdi,9
call le_sqrt
SIGFPE usually happens when you divide a number by 0. In your assembly program, you are using mov rbp, 9 for passing an argument to c function, which might be wrong in your case. It becomes obvious since you're getting SIGFPE. See Microsoft calling conventions and System V ABI calling conventions (for 64-bit). For 32-bit, follow these calling conventions.

How does this C program without libc work?

I came across a minimal HTTP server that is written without libc: https://github.com/Francesco149/nolibc-httpd
I can see that basic string handling functions are defined, leading to the write syscall:
#define fprint(fd, s) write(fd, s, strlen(s))
#define fprintn(fd, s, n) write(fd, s, n)
#define fprintl(fd, s) fprintn(fd, s, sizeof(s) - 1)
#define fprintln(fd, s) fprintl(fd, s "\n")
#define print(s) fprint(1, s)
#define printn(s, n) fprintn(1, s, n)
#define printl(s) fprintl(1, s)
#define println(s) fprintln(1, s)
And the basic syscalls are declared in the C file:
size_t read(int fd, void *buf, size_t nbyte);
ssize_t write(int fd, const void *buf, size_t nbyte);
int open(const char *path, int flags);
int close(int fd);
int socket(int domain, int type, int protocol);
int accept(int socket, sockaddr_in_t *restrict address,
socklen_t *restrict address_len);
int shutdown(int socket, int how);
int bind(int socket, const sockaddr_in_t *address, socklen_t address_len);
int listen(int socket, int backlog);
int setsockopt(int socket, int level, int option_name, const void *option_value,
socklen_t option_len);
int fork();
void exit(int status);
So I guess the magic happens in start.S, which contains _start and a special way of encoding syscalls by creating global labels which fall through and accumulating values in r9 to save bytes:
.intel_syntax noprefix
/* functions: rdi, rsi, rdx, rcx, r8, r9 */
/* syscalls: rdi, rsi, rdx, r10, r8, r9 */
/* ^^^ */
/* stack grows from a high address to a low address */
#define c(x, n) \
.global x; \
x:; \
add r9,n
c(exit, 3) /* 60 */
c(fork, 3) /* 57 */
c(setsockopt, 4) /* 54 */
c(listen, 1) /* 50 */
c(bind, 1) /* 49 */
c(shutdown, 5) /* 48 */
c(accept, 2) /* 43 */
c(socket, 38) /* 41 */
c(close, 1) /* 03 */
c(open, 1) /* 02 */
c(write, 1) /* 01 */
.global read /* 00 */
read:
mov r10,rcx
mov rax,r9
xor r9,r9
syscall
ret
.global _start
_start:
xor rbp,rbp
xor r9,r9
pop rdi /* argc */
mov rsi,rsp /* argv */
call main
call exit
Is this understanding correct? GCC use the symbols defined in start.S for the syscalls, then the program starts in _start and calls main from the C file?
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
(I cloned the repo and tweaked the .c and .S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning that up and sending a pull request. With clang, inline asm for the syscalls does save size overall, especially once main has no calls and no rets. IDK if I want to hand-golf the whole .asm after regenerating from compiler output; there are certainly chunks of it where significant savings are possible, e.g. using lodsb in loops.)
It looks like they need r9 to be 0 before a call to any of these labels, either with a register global var or maybe gcc -ffixed-r9 to tell GCC to keep its hands off that register permanently. Otherwise GCC would have left whatever garbage in r9, just like other registers.
Their functions are declared with normal prototypes, not 6 args with dummy 0 args to get every call site to actually zero r9, so that's not how they're doing it.
special way of encoding syscalls
I wouldn't describe that as "encoding syscalls". Maybe "defining syscall wrapper functions". They're defining their own wrapper function for each syscall, in an optimized way that falls through into one common handler at the bottom. In the C compiler's asm output, you'll still see call write.
(It might have been more compact for the final binary to use inline asm to let the compiler inline a syscall instruction with the args in the right registers, instead of making it look like a normal function that clobbers all the call-clobbered registers. Especially if compiled with clang -Oz which would use 3-byte push 2 / pop rax instead of 5-byte mov eax, 2 to set up the call number. push imm8/pop/syscall is the same size as call rel32.)
Yes, you can define functions in hand-written asm with .global foo / foo:. You could look at this as one large function with multiple entry points for different syscalls. In asm, execution always passes to the next instruction, regardless of labels, unless you use a jump/call/ret instruction. The CPU doesn't know about labels.
So it's just like a C switch(){} statement without break; between case: labels, or like C labels you can jump to with goto. Except of course in asm you can do this at global scope, while in C you can only goto within a function. And in asm you can call instead of just goto (jmp).
static long callnum = 0; // r9 = 0 before a call to any of these
...
socket:
callnum += 38;
close:
callnum++; // can use inc instead of add 1
open: // missed optimization in their asm
callnum++;
write:
callnum++;
read:
tmp=callnum;
callnum=0;
retval = syscall(tmp, args);
Or if you recast this as a chain of tailcalls, where we can omit even the jmp foo and instead just fall through: C like this truly could compile to the hand-written asm, if you had a smart enough compiler. (And you could solve the arg-type
register long callnum asm("r9"); // GCC extension
long open(args...) {
callnum++;
return write(args...);
}
long write(args...) {
callnum++;
return read(args...); // tailcall
}
long read(args...){
tmp=callnum;
callnum=0; // reset callnum for next call
return syscall(tmp, args...);
}
args... are the arg-passing registers (RDI, RSI, RDX, RCX, R8) which they simply leave unmodified. R9 is the last arg-passing register for x86-64 System V, but they didn't use any syscalls that take 6 args. setsockopt takes 5 args so they couldn't skip the mov r10, rcx. But they were able to use r9 for something else, instead of needing it to pass the 6th arg.
That's amusing that they're trying so hard to save bytes at the expense of performance, but still use xor rbp,rbp instead of xor ebp,ebp. Unless they build with gcc -Wa,-Os start.S, GAS won't optimize away the REX prefix for you. (Does GCC optimize assembly source file?)
They could save another byte with xchg rax, r9 (2 bytes including REX) instead of mov rax, r9 (REX + opcode + modrm). (Code golf.SE tips for x86 machine code)
I'd also have used xchg eax, r9d because I know Linux system call numbers fit in 32 bits, although it wouldn't save code size because a REX prefix is still needed to encode the r9d register number. Also, in the cases where they only need to add 1, inc r9d is only 3 bytes, vs. add r9d, 1 being 4 bytes (REX + opcode + modrm + imm8). (The no-modrm short-form encoding of inc is only available in 32-bit mode; in 64-bit mode it's repurposed as a REX prefix.)
mov rsi,rsp could also save a byte as push rsp / pop rsi (1 byte each) instead of 3-byte REX + mov. That would make room for returning main's return value with xchg edi, eax before call exit.
But since they're not using libc, they could inline that exit, or put the syscalls below _start so they can just fall into it, because exit happens to be the highest-numbered syscall! Or at least jmp exit since they don't need stack alignment, and jmp rel8 is more compact than call rel32.
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
No, that's fully stand-alone incorporating the start.S code (at the ?_017: label), and maybe hand-tweaked compiler output. Perhaps from hand-tweaking disassembly of a linked executable, hence not having nice label names even for the part from the hand-written asm. (Specifically, from Agner Fog's objconv, which uses that format for labels in its NASM-syntax disassembly.)
(Ruslan also pointed out stuff like jnz after cmp, instead of jne which has the more appropriate semantic meaning for humans, so another sign of it being compiler output, not hand-written.)
I don't know how they arranged to get the compiler not to touch r9. It seems just luck. The readme indicates that just compiling the .c and .S works for them, with their GCC version.
As far as the ELF headers, see the comment at the top of the file, which links A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - you'd assemble this with nasm -fbin and the output is a complete ELF binary, ready to run. Not a .o that you need to link + strip, so you get to account for every single byte in the file.
You're pretty much correct about what's going on. Very interesting, I've never seen something like this before. But basically as you said, every time it calls the label, as you said, r9 keeps adding up until it reaches read, whose syscall number is 0. This is why the order is pretty clever. Assuming r9 is 0 before read is called (the read label itself zeroes r9 before calling the correct syscall), no adding is needed because r9 already has the correct syscall number that is needed. write's syscall number is 1, so it only needs to be added by 1 from 0, which is shown in the macro call. open's syscall number is 2, so first it is added by 1 at the open label, then again by 1 at the write label, and then the correct syscall number is put into rax at the read label. And so on. Parameter registers like rdi, rsi, rdx, etc. are also not touched so it basically acts like a normal function call.
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
I'm assuming you're talking about this file. Not sure exactly what's going on here, but it looks like an ELF file is manually being created, probably to reduce size further.

what's the purpose of pushing address of local variables on the stack(assembly)

Let's there is a function:
int caller()
{
int arg1 = 1;
int arg2 = 2
int a = test(&arg1, &arg2)
}
test(int *a, int *b)
{
...
}
so I don't understand why &arg1 and &arg2 have to be pushed on the stack too like this
I can understand that we can get address of arg1 and arg2 in the callee by using
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
but if we don't push these two on the stack,
we can also can their address by using:
leal 8(%ebp), %edx
leal 12(%ebp), %ecx
so why bother pushing &arg1 and &arg2 on the stack?
In the general case, test has to work when you pass it arbitrary pointers, including to extern int global_var or whatever. Then main has to call it according to the ABI / calling convention.
So the asm definition of test can't assume anything about where int *a points, e.g. that it points into its caller's stack frame.
(Or you could look at that as optimizing away the addresses in a call-by-reference on locals, so the caller must place the pointed-to objects in the arg-passing slots, and on return those 2 dwords of stack memory hold the potentially-updated values of *a and *b.)
You compiled with optimization disabled. Especially for the special case where the caller is passing pointers to locals, the solution to this problem is to inline the whole function, which compilers will do when optimization is enabled.
Compilers are allowed to make a private clone of test that takes its args by value, or in registers, or with whatever custom calling convention the compiler wants to use. Most compilers don't actually do this, though, and rely on inlining instead of custom calling conventions for private functions to get rid of arg-passing overhead.
Or if it had been declared static test, then the compiler would already know it was private and could in theory use whatever custom calling convention it wanted without making a clone with a name like test.clone1234. gcc does sometimes actually do that for constant-propagation, e.g. if the caller passes a compile-time constant but gcc chooses not to inline. (Or can't because you used __attribute__((noinline)) static test() {})
And BTW, with a good register-args calling convention like x86-64 System V, the caller would do lea 12(%rsp), %rdi / lea 8(%rsp), %rsi / call test or something. The i386 System V calling convention is old and inefficient, passing everything on the stack forcing a store/reload.
You have basically identified one of the reasons that stack-args calling conventions have higher overhead and generally suck.
if you access arg1 and arg2 directly, it means you are accessing a portion of stack that does not belong to this function. This is somehow what happens when someone uses a buffer overflow attack to access additional data from calling stack.
When your call has arguments, arguments are pushed into stack(in your case &arg1 and &arg2) and function can use them as valid list of arguments for this function.

C compiler ignore second argument

I am trying to write a wrapper for a library and I'm attempting to get the compiler to ignore the second argument (for lack of a better phrase) when compiling.
I would like to get this to happen in Rust, but C would be fine as well.
Just as an example, suppose I have this C code
void func(int a, int b, int c) {
// ...
}
The calling convention for this in assembly would be arg 0 to rdi, arg 1 to rsi, arg 2 to rdx, arg 3 to rcx, arg 4 to r8, arg 5 to r9, and the rest to the stack.
I want to tell the compiler to put arg 0 in rdi and arg 1 in rdx, essentially skipping rsi. This is just for this particular calling convention.
The library I am calling against uses the second argument as an offset to a jump table essentially, but my wrapper will handle this internally.
My code would essentially look like
wrapper(arg0, arg1, arg2, ...) -> api(arg1, <static offset>, arg1, arg2, ...)
I want to try to do this without having to manually shift over each argument since the API call is of indefinite arity and shifting stuff into the stack could prove to be an issue. Also, I am trying to make cross compilable, so this particular calling convention may not always be used.

Get a probed function's arguments in the entry_handler of a kretprobe

I'm trying to intercept the kmalloc using a kretprobe void *__kmalloc(size_t size, gfp_t flags);
I can find out the return value of kmalloc using the handler member of the kretprobe structure.
static struct kretprobe kmalloc_probe = {
.handler = kmalloc_ret_handler,
.entry_handler = kmalloc_entry_handler,
.data_size = sizeof(struct kmalloc_read_args),
.maxactive = 20,
};
But I need a way to find the arguments the function was called with in the entry_handler.
This is my entry_handler function:
static int kmalloc_entry_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
I tried searching in all the registers of the regs struct argument, but no luck. The architecture I'm using using is i686.
I know jprobes would be a better match for solving this type of problem but I need to solve it using only kretprobes.
Can you please give me a hint of how I could use the registers, or the stack to find the function call arguments?
A link to the pt_regs structure: http://lxr.free-electrons.com/source/arch/x86/include/asm/ptrace.h#L11
The conventions for argument passing in the kernel on x86 are described in the comments in asm/calling.h.
On 32-bit x86 systems, the first parameters of the functions in the Linux kernel (except system calls and some other stuff) are usually passed in %eax, %edx, %ecx, in order. This is because the sources are compiled with '-mregparm=3' GCC option, set by default. This is the case since kernel 2.6.32 at least, or may be from even earlier.
The remaining parameters are passed on stack.
If the function had a variable argument list (like sprintf()), all parameters were passed on stack, as far as I had seen.
So, in your case, size should be in %eax and flags - in %edx on entry to the function. If these registers are not clobbered somehow by the kretprobe, you should be able to find them in pt_regs.
On 64-bit x86 systems, the convention is simpler and more in line with x86-64 ABI. The first arguments of the kernel functions (again, except system calls and some special functions) are passed in %rdi, %rsi, %rdx, %rcx, %r8, %r9, in order, the remaining ones are on stack.

Resources