I am currently following along with this tutorial,
but I'm not a student of that school.
GDB gives me a segmentation fault in thread_start on the line:
movq %rsp, (%rdi) # save sp in old thread's tcb
Here's additional info when I backtrace:
#0 thread_start () at thread_start.s:16
#1 0x0000000180219e83 in _cygtls::remove(unsigned int)::__PRETTY_FUNCTION__
() from /usr/bin/cygwin1.dll
#2 0x00000000ffffcc6b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Being a newbie, I can't for my life figure out why. Here is my main file:
#define STACK_SIZE 1024*1024
//Thread TCB
struct thread {
unsigned char * stack_pointer;
void(*initial_function)(void *);
void * initial_argument;
};
struct thread * current_thread;
struct thread * inactive_thread;
void thread_switch(struct thread * old_t, struct thread * new_t);
void thread_start(struct thread * old_t, struct thread * new_t);
void yield() {
//swap threads
struct thread * temp = current_thread;
current_thread = inactive_thread;
inactive_thread = temp;
thread_switch(inactive_thread, current_thread);
}
void thread_wrap() {
// call the thread's function
current_thread->initial_function(current_thread->initial_argument);
yield();
}
int factorial(int n) {
return n == 0 ? 1 : n * factorial(n - 1);
}
// calls and print the factorial
void fun_with_threads(void * arg) {
int n = *(int*)arg;
printf("%d! = %d\n", n, factorial(n));
}
int main() {
//allocate memory for threads
inactive_thread = (struct thread*) malloc(sizeof(struct thread));
current_thread = (struct thread*) malloc(sizeof(struct thread));
// argument for factorial
int *p= (int *) malloc(sizeof(int));
*p = 5;
// intialise thread
current_thread->initial_argument = p;
current_thread->initial_function = fun_with_threads;
current_thread->stack_pointer = ((unsigned char*) malloc(STACK_SIZE)) + STACK_SIZE;
thread_start(inactive_thread, current_thread);
return 0;
}
Here's my asm code for thread_start
# Inline comment
/* Block comment */
# void thread_switch(struct thread * old_t, struct thread * new_t);
.globl thread_start
thread_start:
pushq %rbx # callee-save
pushq %rbp # callee-save
pushq %r12 # callee-save
pushq %r13 # callee-save
pushq %r14 # callee-save
pushq %r15 # callee-save
movq %rsp, (%rdi) # save sp in old thread's tcb
movq (%rsi), %rsp # load sp from new thread
jmp thread_wrap
and thread_switch:
# Inline comment
/* Block comment */
# void thread_switch(struct thread * old_t, struct thread * new_t);
.globl thread_switch
thread_switch:
pushq %rbx # callee-save
pushq %rbp # callee-save
pushq %r12 # callee-save
pushq %r13 # callee-save
pushq %r14 # callee-save
pushq %r15 # callee-save
movq %rsp, (%rdi) # save sp in old thread's tcb
movq (%rsi), %rsp # load sp from new thread
popq %r15 # callee-restore
popq %r14 # callee-restore
popq %r13 # callee-restore
popq %r12 # callee-restore
popq %rbp # callee-restore
popq %rbx # callee-restore
ret # return
You're on cygwin, right? It uses the Windows x64 calling convention by default, not the System V x86-64 psABI. So your args aren't in %rdi and %rsi.
The calling convention is Windows x64, but the ABI is slightly different: long is 64 bit, so it's LP64 not LLP64. See the cygwin docs.
You could override the default with __attribute__((sysv_abi)) on the prototype, but that only works for compilers that understand GNU C.
Agner Fog's calling convention guide has some suggestions on how to write source code that assembles to working functions on Windows vs. non-Windows. The most straightforward thing is to use an #ifdef to choose different function prologues.
This Intel intro to x64 assembly is somewhat Windows-centric, and details the Windows x64 __fastcall calling convention.
(It's followed by examples and stuff. It's a pretty big and good tutorial that starts from very basic stuff, including how to use tools like an assembler. I'd recommend it for learning x86-64 asm in a Windows dev environment, and maybe in general.)
Windows x64 __fastcall (like x64 __vectorcall but doesn't pass vectors in vector regs)
RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right
XMM0, 1, 2, and 3 are used for floating point arguments.
Additional arguments are pushed on the stack left to right.
Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
It is the caller's responsibility to allocate 32 bytes of "shadow space" (for storing RCX, RDX, R8, and R9 if needed) before calling the
function.
It is the caller's responsibility to clean the stack after the call.
Integer return values (similar to x86) are returned in RAX if 64 bits or less.
Floating point return values are returned in XMM0.
Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when
the callee is called. Register usage for integer parameters is then
pushed one to the right. RAX returns this address to the caller.
The stack is 16-byte aligned. The "call" instruction pushes an 8-byte return value, so the all non-leaf functions must adjust the
stack by a value of the form 16n+8 when allocating stack space.
Registers RAX, RCX, RDX, R8, R9, R10, and R11 are considered volatile and must be considered destroyed on function calls. RBX, RBP,
RDI, RSI, R12, R14, R14, and R15 must be saved in any function using
them.
Note there is no calling convention for the floating point (and thus MMX) registers.
Further details (varargs, exception handling, stack unwinding) are at Microsoft's site.
Links to MS's calling-convention docs in the x86 tag wiki (along with System V ABI docs, and tons of other good stuff).
See also Why does Windows64 use a different calling convention from all other OSes on x86-64?
Related
I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.
extern void restore_context(struct fiber_context*);
extern void create_context(struct fiber_context*);
void foo_fiber()
{
printf("Called as a fiber");
exit(0);
}
int main()
{
const uint32_t stack_size = 4096 * 16;
const uint32_t red_zone_abi = 128;
char* stack = aligned_alloc(16, stack_size);
char* sp = stack + stack_size - red_zone_abi;
struct fiber_context c = {0};
c.rip = (void*)foo_fiber;
c.rsp = (void*)sp;
restore_context(&c);
}
where restore_context code is as follows:
.type restore_context, #function
.global restore_context
restore_context:
movq 8*0(%rdi), %r8
# Load new stack pointer.
movq 8*1(%rdi), %rsp
# Load preserved registers.
movq 8*2(%rdi), %rbx
movq 8*3(%rdi), %rbp
movq 8*4(%rdi), %r12
movq 8*5(%rdi), %r13
movq 8*6(%rdi), %r14
movq 8*7(%rdi), %r15
# Push RIP to stack for RET.
pushq %r8
xorl %eax, %eax
ret
So basically i am creating a new stack on the heap, and since the stack growns downwards, i take the end address - 128 bytes of red zone (which is necessary in the ABI). What restore_context does is simply swap %rsp to my new stack, and push address of foo_fiber onto it and then ret's to jump into foo_fiber. (it also loads some registers from fiber_context structure, but it should not matter now).
From what im seeing in GDB, the program manages to properly jump to foo_fiber and into printf, and then it crashes in _vprintf_internal on movaps %xmm1, 0x10(%rsp).
| 0x7ffff7e2f389 <__vfprintf_internal+153> movdqu (%rax),%xmm1 │
│ 0x7ffff7e2f38d <__vfprintf_internal+157> movups %xmm1,0x128(%rsp) │
│ 0x7ffff7e2f395 <__vfprintf_internal+165> mov 0x10(%rax),%rax │
│ >0x7ffff7e2f399 <__vfprintf_internal+169> movaps %xmm1,0x10(%rsp)
I find that extremely odd since it managed movups %xmm1, 0x128(%rsp) so a much higher offset from stack pointer. What is going on there?
If i change the code of foo_fiber to do something else, for example allocate and randomly fill char[100], it works.
I am kind of at loss about what is going on. At first i thought i might have alignment issues, since the vector xmm functions are crashing, so I changed malloc to aligned_alloc. The crash i am getting is a SIGSEGV, but 0x10
Agree with comments: your stack alignment is incorrect.
It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.
Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf crash when it tries to use aligned move instructions.
You could subtract 8 from the sp computed in your C code.
Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context with just
callq *(8*0)(%rdi)
I'm messing around with nasm, and after doing a hello world with no problem, I though I'd try to do some c integration.
I'm using c to open a file, and I then want to use the pointer returned for the open file to process the text. However, when I call fgetc with the pointer in rdi, I get a "no such file or directory", followed by a segfault.
What am I doing wrong?
int64_t asmFunc(FILE* a, char* b);
int main()
{
int num;
FILE *fptr;
size_t line_buf_size = 0;
char *ret = malloc(100);
fptr = fopen("./test.txt","r");
if(fptr == NULL)
{
printf("Error!");
exit(1);
}
printf("%ld", asmFunc(fptr, ret));
return 0;
}
global asmFunc
section .text
extern fgetc
asmFunc:
call fgetc ; segfault occurs here.
(...)
ret
The first instruction of asmFunc isn't a call either, but I removed some setup stuff for later operations to make it easier to read.
Well that just defeats the entire purpose of an MCVE. You need to re-run your test after simplifying to make sure it still shows the same problem as your full version. But for this answer, I'll assume your setup didn't clobber the fptr arg in RDI or modify RSP.
asmFunc:
call fgetc ; segfault occurs here.
fptr will still be in RDI, where your caller passed it, so that's correct for int fgetc(FILE *fp).
So presumably fgetc is segfaulting because you called it with a misaligned stack. (It was 16-byte aligned before the call that jumped to asmFunc, but you don't have an odd number of pushes or any sub rsp, 8*n). Modern builds of glibc actually do depend on 16-byte alignment for scanf (glibc scanf Segmentation faults when called from a function that doesn't align RSP) so it's easy to imagine that fgetc includes code that also compiles to include a movaps of something on the stack.
Once you fix this bug, you will have the problem that call fgetc destroys your char *ret arg, because your caller passes it in RSI. Arg-passing registers are call-clobbered. What registers are preserved through a linux x86-64 function call
asmFunc: ; (FILE *fptr, char *ret)
push rsi ; save ret
call fgetc
pop rsi
mov [rsi], al
ret
A C compiler would normally save/restore RBX and use mov to save ret there.
asmFunc: ; (FILE *fptr, char *ret)
push rbx
mov rbx, rsi ; save ret
call fgetc
mov [rbx], al
pop rbx ; restore rbx
ret
However, when I call fgetc with the pointer in rdi, I get a "no such file or directory", followed by a segfault.
No idea how you're getting "no such file or directory". Is that from your debugger looking source for glibc functions? If it's part of what your program itself prints, that makes near zero sense, because you do exit(1) correctly when fptr == NULL. And you don't use perror() or anything else that looks up errno codes to generate standard error strings.
You need to learn and follow the calling conventions documented in Linux x86-64 ABI specification, in particular its §3.2.3 Parameter passing section. So the pointer value fptr is in %rdi, and the pointer value ret is in %rsi and you probably should push a call frame for your asmFunc
Read also the x86 calling conventions wikipage.
If you are able to code the equivalent (even a simplified one) of asmFunc in C in some example.c file, I recommend compiling it with gcc -O -fverbose-asm -Wall -S example.c and looking into the emitted example.s assembler file for inspiration. Most of the time, the first machine instruction of such a function is not a call (but something, called the function prologue, changing the stack pointer %esp and allocating some call frame on the call stack)
For example, on my Linux/Debian/x86-64 with gcc-8
void asmfunc(FILE* fil, char*s) {
fputc ('\t', fil);
fputs (s, fil);
fputc ('\n', fil);
fflush (fil);
}
is compiled into:
.text
.globl asmfunc
.type asmfunc, #function
asmfunc:
.LFB11:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
pushq %rbx #
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
subq $8, %rsp #,
.cfi_def_cfa_offset 32
movq %rdi, %rbx # fil, fil
movq %rsi, %rbp # s, s
# /tmp/example.c:4: fputc ('\t', fil);
movq %rdi, %rsi # fil,
movl $9, %edi #,
call fputc#PLT #
# /tmp/example.c:5: fputs (s, fil);
movq %rbx, %rsi # fil,
movq %rbp, %rdi # s,
call fputs#PLT #
# /tmp/example.c:6: fputc ('\n', fil);
movq %rbx, %rsi # fil,
movl $10, %edi #,
call fputc#PLT #
# /tmp/example.c:7: fflush (fil);
movq %rbx, %rdi # fil,
call fflush#PLT #
# /tmp/example.c:8: }
addq $8, %rsp #,
.cfi_def_cfa_offset 24
popq %rbx #
.cfi_def_cfa_offset 16
popq %rbp #
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE11:
.size asmfunc, .-asmfunc
.ident "GCC: (Debian 8.3.0-6) 8.3.0"
Notice however that in some cases, GCC is capable (e.g. with -O2) of tail-call optimizations and might call some leaf-functions specially.
I'm trying to write a scheduler to run what we call "fibers".
Unfortunately, I'm not really used to writing inline assembly.
typedef struct {
//fiber's stack
long rsp;
long rbp;
//next fiber in ready list
struct fiber *next;
} fiber;
//currently executing fiber
fiber *fib;
So the very first task is - obviously - creating a fiber for the main function so it can be suspended.
int main(int argc, char* argv[]){
//create fiber for main function
fib = malloc(sizeof(*fib));
__asm__(
"movq %%rsp, %0;"
"movq %%rbp, %1;"
: "=r"(fib->rsp),"=r"(fib->rbp)
);
//jump to actual main and execute
__asm__(...);
}
This gets compiled to
movl $24, %edi #,
call malloc #
#APP
# 27 "scheduler.c" 1
movq %rsp, %rcx;movq %rbp, %rdx; # tmp92, tmp93
# 0 "" 2
#NO_APP
movq %rax, fib(%rip) # tmp91, fib
movq %rcx, (%rax) # tmp92, MEM[(struct fiber *)_3].rsp
movq %rdx, 8(%rax) # tmp93, MEM[(struct fiber *)_3].rbp
Why does this compile movs into temporary registers? Can I somehow get rid of them?
The first version of this question had asm output from gcc -O0, with even more instructions and temporaries.
Turning on optimisations does not get rid of them.
turning them on does not get rid of the temporaries
It did get rid of some extra loads and stores. The fib is of course still there in memory since you declared that as a global variable. The rax is the return value from the malloc that must be assigned to the fib in memory. The other two lines write into your fib members which are also required.
Since you specified register outputs the asm block can't write directly into memory. That's easy to fix with a memory constraint though:
__asm__(
"movq %%rsp, %0;"
"movq %%rbp, %1;"
: "=m"(fib->rsp),"=m"(fib->rbp)
);
This will generate:
call malloc
movq %rax, fib(%rip)
movq %rsp, (%rax)
movq %rbp, 8(%rax)
Casually, when reading the assembler listing of a sample C program, I noted that the stack pointer is not 16 bit aligned before calling function foo:
void foo() { }
int func(int p) { foo(); return p; }
int main() { return func(1); }
func:
pushq %rbp
movq %rsp, %rbp
subq $8, %rsp ; See here
movl %edi, -4(%rbp)
movl $0, %eax
call foo
movl -4(%rbp), %eax
leave
ret
The subq $8, %rsp instruction makes RSP not aligned before calling foo (it should be "subq $16, %rsp").
In System V ABI, par. 3.2.2, I read: "the value (%rsp − 8) is always a multiple of 16 when control is transferred to the function entry point".
Someone can help me to understand why gcc doesn't put subq $16, %rsp ?
Thank you in advance.
Edit:
I forgot to mention my OS and compiler version:
Debian wheezy, gcc 4.7.2
Assuming that the stack pointer is 16-byte aligned when func is entered, then the combination of
pushq %rbp ; <- 8 bytes
movq %rsp, %rbp
subq $8, %rsp ; <- 8 bytes
will keep it 16-byte aligned for the subsequent call to foo().
It seems that since the compiler knows about the implementation of foo() and that it's a noop, it's not bothering with the stack alignment. If foo() is seen as only a declaration or prototype in the translation unit where func() is compiled you'll see your expected stack alignment.
I compiled the following C code:
typedef struct {
long x, y, z;
} Foo;
long Bar(Foo *f, long i)
{
return f[i].x + f[i].y + f[i].z;
}
with the command gcc -S -O3 test.c. Here is the Bar function in the output:
.section __TEXT,__text,regular,pure_instructions
.globl _Bar
.align 4, 0x90
_Bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
leaq (%rsi,%rsi,2), %rcx
movq 8(%rdi,%rcx,8), %rax
addq (%rdi,%rcx,8), %rax
addq 16(%rdi,%rcx,8), %rax
popq %rbp
ret
Leh_func_end1:
I have a few questions about this assembly code:
What is the purpose of "pushq %rbp", "movq %rsp, %rbp", and "popq %rbp", if neither rbp nor rsp is used in the body of the function?
Why do rsi and rdi automatically contain the arguments to the C function (i and f, respectively) without reading them from the stack?
I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)? The leaq instruction was replaced with:
imulq $88, %rsi, %rcx
The function is simply building its own stack frame with these instructions. There's nothing really unusual about them. You should note, though, that due to this function's small size, it will probably be inlined when used in the code. The compiler is always required to produce a "normal" version of the function, though. Also, what #ouah said in his answer.
This is because that's how the AMD64 ABI specifies the arguments should be passed to functions.
If the class is INTEGER, the next available register of the sequence
%rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
Page 20, AMD64 ABI Draft 0.99.5 – September 3, 2010
This is not directly related to the structure size, rather - the absolute address that the function has to access. If the size of the structure is 24 bytes, f is the address of the array containing the structures, and i is the index at which the array has to be accessed, then the byte offset to each structure is i*24. Multiplying by 24 in this case is achieved by a combination of lea and SIB addressing. The first lea instruction simply calculates i*3, then every subsequent instruction uses that i*3 and multiplies it further by 8, therefore accessing the array at the needed absolute byte offset, and then using immediate displacements to access the individual structure members ((%rdi,%rcx,8). 8(%rdi,%rcx,8), and 16(%rdi,%rcx,8)). If you make the size of the structure 88 bytes, there is simply no way of doing such a thing swiftly with a combination of lea and any kind of addressing. The compiler simply assumes that a simple imull will be more efficient in calculating i*88 than a series of shifts, adds, leas or anything else.
What is the purpose of pushq %rbp, movq %rsp, %rbp, and popq %rbp, if neither rbp nor rsp is used in the body of the function?
To keep track of the frames when you use a debugger. Add -fomit-frame-pointer to optimize (note that it should be enabled at -O3 but in a lot of gcc versions I used it is not).
3. I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)?
The leaq call is (essentially and in this cae) calculating k*a+b where "k" is 1, 2, 4, or 8 and "a" and "b" are registers. If "a" and "b" are the same, it can be used for structures of 1, 2, 3, 4, 5, 8, and 9 longs.
Larger structures like 16 longs may be optimizable by calculating the offset with for "k" and doubling, but I do not know if that is what the compiler will actually do; you would have to test.