C fibers crashing on printf - c

I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.
extern void restore_context(struct fiber_context*);
extern void create_context(struct fiber_context*);
void foo_fiber()
{
printf("Called as a fiber");
exit(0);
}
int main()
{
const uint32_t stack_size = 4096 * 16;
const uint32_t red_zone_abi = 128;
char* stack = aligned_alloc(16, stack_size);
char* sp = stack + stack_size - red_zone_abi;
struct fiber_context c = {0};
c.rip = (void*)foo_fiber;
c.rsp = (void*)sp;
restore_context(&c);
}
where restore_context code is as follows:
.type restore_context, #function
.global restore_context
restore_context:
movq 8*0(%rdi), %r8
# Load new stack pointer.
movq 8*1(%rdi), %rsp
# Load preserved registers.
movq 8*2(%rdi), %rbx
movq 8*3(%rdi), %rbp
movq 8*4(%rdi), %r12
movq 8*5(%rdi), %r13
movq 8*6(%rdi), %r14
movq 8*7(%rdi), %r15
# Push RIP to stack for RET.
pushq %r8
xorl %eax, %eax
ret
So basically i am creating a new stack on the heap, and since the stack growns downwards, i take the end address - 128 bytes of red zone (which is necessary in the ABI). What restore_context does is simply swap %rsp to my new stack, and push address of foo_fiber onto it and then ret's to jump into foo_fiber. (it also loads some registers from fiber_context structure, but it should not matter now).
From what im seeing in GDB, the program manages to properly jump to foo_fiber and into printf, and then it crashes in _vprintf_internal on movaps %xmm1, 0x10(%rsp).
| 0x7ffff7e2f389 <__vfprintf_internal+153> movdqu (%rax),%xmm1 │
│ 0x7ffff7e2f38d <__vfprintf_internal+157> movups %xmm1,0x128(%rsp) │
│ 0x7ffff7e2f395 <__vfprintf_internal+165> mov 0x10(%rax),%rax │
│ >0x7ffff7e2f399 <__vfprintf_internal+169> movaps %xmm1,0x10(%rsp)
I find that extremely odd since it managed movups %xmm1, 0x128(%rsp) so a much higher offset from stack pointer. What is going on there?
If i change the code of foo_fiber to do something else, for example allocate and randomly fill char[100], it works.
I am kind of at loss about what is going on. At first i thought i might have alignment issues, since the vector xmm functions are crashing, so I changed malloc to aligned_alloc. The crash i am getting is a SIGSEGV, but 0x10

Agree with comments: your stack alignment is incorrect.
It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.
Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf crash when it tries to use aligned move instructions.
You could subtract 8 from the sp computed in your C code.
Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context with just
callq *(8*0)(%rdi)

Related

pointer to function points the jmp instruction in c

I viewed the disassembly of my c code, and found out that pointer to function actually point the jmp instruction, and doesn't point the real start of the function in memory (doesn't point push ebp instruction, that represents start of function's frame).
I have the followed function (that does basically nothing, it's just an example):
int func2(int a, int b)
{
return 1;
}
I tried to print the address of the function- printf("%p", &func2);
I looked at the disassembly of my code, and found out that the address that is printed is the address of the jmp instuction in assembly code. I would like to get the address that represents the start of function's frame. Is there any way to calculate it from the given address of the jmp instruction?
Moreover, I have the bytes that represents the jmp instruction.
011A11EF E9 CC 08 00 00 jmp func2 (011A1AC0h)
How can I get the address that represents the start of function's frame in memory (011A1AC0h in that case), only from the address of the jmp instruction and from the bytes that represents the jmp instruction itself? I read some information about that, and I found out that it is relative jmp, which means that I need to add the value that jmp holds to the address of the jmp instruction itself. Not sure if that's a good direction for the solution, and if it is, how can I get the value that jmp holds?
E916 is the Intel 64 and IA-32 opcode for a jmp instruction with a rel32 offset. The next four bytes contain the offset. Your disassembler shows them as “CC 08 00 00”, but this is reversed; the offset is 000008CC16, which is 225210. The offset is a signed 32-bit value that is added to the EIP register to obtain the address of the jump target. The EIP contains the address of the next instruction to be executed.
So, in this specific case, take the address of the byte just beyond the jump instruction and add the 32-bit offset.
However:
I count 11 forms of jmp instruction in Intel 64 and IA-32 manual. Who knows what the compiler may use when you make a slight change to source or compiler switches and recompile? You would need to be prepared to decode any form of the jmp instruction, or perhaps other instructions the compiler might use.
Intel has some legacy segment features in its architecture. The code segment on your system might be one big thing so you do not have to worry about that, but I cannot provide assurance.
Your compiler might have used this jmp instruction as a convenient way to create a value for the pointer rather than using the routine’s entry point (the proper term for the instruction where function execution normally begins, not frame) because it makes the linker do the relocation work instead of requiring the compiler to insert instructions to do that work at run-time (specifically, at the time the function address must be evaluated so it can be assigned to the pointer). This is somewhat of a guess, but the compiler might do something else next time. You are treading significantly outside normal computing.
I'm not sure to get your question, but take this sample:
#include <stdio.h>
int foo(int x)
{
return x+1;
}
int main(int argc, char** argv)
{
printf("foo = %p\n", foo);
return 0;
}
Which produces the following disassembly:
foo(int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
addl $1, %eax
popq %rbp
ret
.LC0:
.string "foo = %p\n"
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl foo(int), %esi # pass the label argument (2) to printf
movl $.LC0, %edi # pass the format argument (1) to printf
movl $0, %eax
call printf
movl $0, %eax
leave
ret
As you can see, only the label is passed to printf. This label is resolved as an address by the compiler.
Also notice that it will be hard for you to get an absolute address of a running binary: the ASLR (Address Space Layout Randomization will choose a random base address for the binary. The offsets inside the binary still holds, hence relative calls.
On X86 machines E9 is the opcode for JMP rel16/32. So the cpu is going to use the value 0x000008CC as jump offset. The base address is the address of the instruction following the JMP instruction.

How does the stack works?

from what I understood the stack is used in a function to stock all the local variables that are declared.
I also understood that the bottom of the stack correspond to the largest address, and the top to the smallest ones.
So, let's say I have this C program:
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[]){
FILE *file1 = fopen("~/file.txt", "rt");
char buffer[10];
printf(argv[1]);
fclose(file1);
return 0;
}
Where would be pointer named "file1" in the stack compared to pointer named "buffer" ? would it be with upper in the stack (smaller address), or down (larger address) ?
Also, I know that printf() when giving format args (like %d, or %s) will read on the stack, but in this example where will it start to read ?
Wiki article:
http://en.wikipedia.org/wiki/Stack_(abstract_data_type)
The wiki article makes an analogy to a stack of objects, where the top of the stack is the only object you can see (peek) or remove (pop), and where you would add (push) another object onto.
For a typical implementation of a stack, the stack starts at some address and the address decreases as elements are pushed onto the stack. A push typically decrements the stack pointer before storing an element onto the stack, and a pop typically loads an element from the stack and increments the stack pointer after.
However, a stack could also grow upwards, where a push stores an element then increments the stack pointer after, and a pop would decrement the stack pointer before, then load an element from the stack. This is a common way to implement a software stack using an array, where the stack pointer could be a pointer or an index.
Back to the original question, there's no rule on the ordering of local variables on a stack. Typically the total size of all local variables is subtracted from the stack pointer, and the local variables are accessed as offsets from the stack pointer (or a register copy of the stack pointer, such as bp, ebp, or rbp in the case of a X86 processor).
The C language definition does not specify how objects are to be laid out in memory, nor does it specify how arguments are to be passed to functions (the words "stack" and "heap" don't appear anywhere in the language definition itself). That is entirely a function of the compiler and the underlying platform. The answer for x86 may be different from the answer for M68K which may be different from the answer for MIPS which may be different from the answer for SPARC which may be different from the answer for an embedded controller, etc.
All the language definition specifies is lifetime of objects (when storage for an object is allocated and how long it lasts) and the linkage and visibility of identifiers (linkage controls whether multiple instances of the same identifier refer to the same object, visibility controls whether that identifier is usable at a given point).
Having said all that, almost any desktop or server system you're likely to use will have a runtime stack. Also, C was initially developed on a system with a runtime stack, and much of its behavior certainly implies a stack model. A C compiler would be a bugger to implement on a system that didn't use a runtime stack.
I also understood that the bottom of the stack correspond to the largest address, and the top to the smallest ones.
That doesn't have to be true at all. The top of the stack is simply the place something was most recently pushed. Stack elements don't even have to be consecutive in memory (such as when using a linked-list implementation of a stack). On x86, the runtime stack grows "downwards" (towards decreasing addresses), but don't assume that's universal.
Where would be pointer named "file1" in the stack compared to pointer named "buffer" ? would it be with upper in the stack (smaller address), or down (larger address) ?
First, the compiler is not required to lay out distinct objects in memory in the same order that they were declared; it may re-order those objects to minimize padding and alignment issues (struct members must be laid out in the order declared, but there may be unused "padding" bytes between members).
Secondly, only file1 is a pointer. buffer is an array, so space will only be allocated for the array elements themselves - no space is set aside for any pointer.
Also, I know that printf() when giving format args (like %d, or %s) will read on the stack, but in this example where will it start to read ?
It may not read arguments from the stack at all. For example, Linux on x86-64 uses the System V AMD64 ABI calling convention, which passes the first six arguments via registers.
If you're really curious how things look on a particular platform, you need to a) read up on that platform's calling conventions, and b) look at the generated machine code. Most compilers have an option to output a machine code listing. For example, we can take your program and compile it as
gcc -S file.c
which creates a file named file.s containing the following (lightly edited) output:
.file "file.c"
.section .rodata
.LC0:
.string "rt"
.LC1:
.string "~/file.txt"
.text
.globl main
.type main, #function
main:
.LFB2:
pushq %rbp ;; save the current base (frame) pointer
.LCFI0:
movq %rsp, %rbp ;; make the stack pointer the new base pointer
.LCFI1:
subq $48, %rsp ;; allocate an additional 48 bytes on the stack
.LCFI2:
movl %edi, -36(%rbp) ;; since we use the contents of the %rdi(%edi) and %rsi(esi) registers
movq %rsi, -48(%rbp) ;; below, we need to preserve their contents on the stack frame before overwriting them
movl $.LC0, %esi ;; Write the *second* argument of fopen to esi
movl $.LC1, %edi ;; Write the *first* argument of fopen to edi
call fopen ;; arguments to fopen are passed via register, not the stack
movq %rax, -8(%rbp) ;; save the result of fopen to file1
movq $0, -32(%rbp) ;; zero out the elements of buffer (I added
movw $0, -24(%rbp) ;; an explicit initializer to your code)
movq -48(%rbp), %rax ;; copy the pointer value stored in argv to rax
addq $8, %rax ;; offset 8 bytes (giving us the address of argv[1])
movq (%rax), %rdi ;; copy the value rax points to to rdi
movl $0, %eax
call printf ;; like with fopen, arguments to printf are passed via register, not the stack
movq -8(%rbp), %rdi ;; copy file1 to rdi
call fclose ;; again, arguments are passed via register
movl $0, %eax
leave
ret
Now, this is for my specific platform, which is Linux (SLES-10) on x86-64. This does not apply to different hardware/OS combinations.
EDIT
Just realized that I left out some important stuff.
The notation N(reg) means offset N bytes from the address stored in register reg (basically, reg acts as a pointer). %rbp is the base (frame) pointer - it basically acts as the "handle" for the current stack frame. Local variables and function arguments (assuming they are present on the stack) are accessed by offsetting from the address stored in %rbp. On x86, local variables typically have a negative offset from %rbp, while function arguments have a positive offset.
The memory for file1 starts at -8(%rbp) (pointers on x86-64 are 64 bits wide, so we need 8 bytes to store it). That's fairly easy to determine based on the lines
call fopen
movq %rax, -8(%rbp)
On x86, function return values are written to %rax or %eax (%eax is the lower 32 bits of %rax). So the result of fopen is written to %rax, and we copy the contents of %rax to -8(%rbp).
The location for buffer is a little trickier to determine, since you don't do anything with it. I added an explicit initializer (char buffer[10] = {0};) just to generate some instructions that access it, and those are
movq $0, -32(%rbp)
movw $0, -24(%rbp)
From this, we can determine that buffer starts at -32(%rbp). There's 14 bytes of unused "padding" space between the end of buffer and the beginning of file1.
Again, this is how things play out on my specific system; you may see something different.
Very implementation dependent but still nearby. In faxt this is very crucial to setting up buffer overflow based attacks.

x86-64 segmentation fault saving stack pointer

I am currently following along with this tutorial,
but I'm not a student of that school.
GDB gives me a segmentation fault in thread_start on the line:
movq %rsp, (%rdi) # save sp in old thread's tcb
Here's additional info when I backtrace:
#0 thread_start () at thread_start.s:16
#1 0x0000000180219e83 in _cygtls::remove(unsigned int)::__PRETTY_FUNCTION__
() from /usr/bin/cygwin1.dll
#2 0x00000000ffffcc6b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Being a newbie, I can't for my life figure out why. Here is my main file:
#define STACK_SIZE 1024*1024
//Thread TCB
struct thread {
unsigned char * stack_pointer;
void(*initial_function)(void *);
void * initial_argument;
};
struct thread * current_thread;
struct thread * inactive_thread;
void thread_switch(struct thread * old_t, struct thread * new_t);
void thread_start(struct thread * old_t, struct thread * new_t);
void yield() {
//swap threads
struct thread * temp = current_thread;
current_thread = inactive_thread;
inactive_thread = temp;
thread_switch(inactive_thread, current_thread);
}
void thread_wrap() {
// call the thread's function
current_thread->initial_function(current_thread->initial_argument);
yield();
}
int factorial(int n) {
return n == 0 ? 1 : n * factorial(n - 1);
}
// calls and print the factorial
void fun_with_threads(void * arg) {
int n = *(int*)arg;
printf("%d! = %d\n", n, factorial(n));
}
int main() {
//allocate memory for threads
inactive_thread = (struct thread*) malloc(sizeof(struct thread));
current_thread = (struct thread*) malloc(sizeof(struct thread));
// argument for factorial
int *p= (int *) malloc(sizeof(int));
*p = 5;
// intialise thread
current_thread->initial_argument = p;
current_thread->initial_function = fun_with_threads;
current_thread->stack_pointer = ((unsigned char*) malloc(STACK_SIZE)) + STACK_SIZE;
thread_start(inactive_thread, current_thread);
return 0;
}
Here's my asm code for thread_start
# Inline comment
/* Block comment */
# void thread_switch(struct thread * old_t, struct thread * new_t);
.globl thread_start
thread_start:
pushq %rbx # callee-save
pushq %rbp # callee-save
pushq %r12 # callee-save
pushq %r13 # callee-save
pushq %r14 # callee-save
pushq %r15 # callee-save
movq %rsp, (%rdi) # save sp in old thread's tcb
movq (%rsi), %rsp # load sp from new thread
jmp thread_wrap
and thread_switch:
# Inline comment
/* Block comment */
# void thread_switch(struct thread * old_t, struct thread * new_t);
.globl thread_switch
thread_switch:
pushq %rbx # callee-save
pushq %rbp # callee-save
pushq %r12 # callee-save
pushq %r13 # callee-save
pushq %r14 # callee-save
pushq %r15 # callee-save
movq %rsp, (%rdi) # save sp in old thread's tcb
movq (%rsi), %rsp # load sp from new thread
popq %r15 # callee-restore
popq %r14 # callee-restore
popq %r13 # callee-restore
popq %r12 # callee-restore
popq %rbp # callee-restore
popq %rbx # callee-restore
ret # return
You're on cygwin, right? It uses the Windows x64 calling convention by default, not the System V x86-64 psABI. So your args aren't in %rdi and %rsi.
The calling convention is Windows x64, but the ABI is slightly different: long is 64 bit, so it's LP64 not LLP64. See the cygwin docs.
You could override the default with __attribute__((sysv_abi)) on the prototype, but that only works for compilers that understand GNU C.
Agner Fog's calling convention guide has some suggestions on how to write source code that assembles to working functions on Windows vs. non-Windows. The most straightforward thing is to use an #ifdef to choose different function prologues.
This Intel intro to x64 assembly is somewhat Windows-centric, and details the Windows x64 __fastcall calling convention.
(It's followed by examples and stuff. It's a pretty big and good tutorial that starts from very basic stuff, including how to use tools like an assembler. I'd recommend it for learning x86-64 asm in a Windows dev environment, and maybe in general.)
Windows x64 __fastcall (like x64 __vectorcall but doesn't pass vectors in vector regs)
RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right
XMM0, 1, 2, and 3 are used for floating point arguments.
Additional arguments are pushed on the stack left to right.
Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
It is the caller's responsibility to allocate 32 bytes of "shadow space" (for storing RCX, RDX, R8, and R9 if needed) before calling the
function.
It is the caller's responsibility to clean the stack after the call.
Integer return values (similar to x86) are returned in RAX if 64 bits or less.
Floating point return values are returned in XMM0.
Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when
the callee is called. Register usage for integer parameters is then
pushed one to the right. RAX returns this address to the caller.
The stack is 16-byte aligned. The "call" instruction pushes an 8-byte return value, so the all non-leaf functions must adjust the
stack by a value of the form 16n+8 when allocating stack space.
Registers RAX, RCX, RDX, R8, R9, R10, and R11 are considered volatile and must be considered destroyed on function calls. RBX, RBP,
RDI, RSI, R12, R14, R14, and R15 must be saved in any function using
them.
Note there is no calling convention for the floating point (and thus MMX) registers.
Further details (varargs, exception handling, stack unwinding) are at Microsoft's site.
Links to MS's calling-convention docs in the x86 tag wiki (along with System V ABI docs, and tons of other good stuff).
See also Why does Windows64 use a different calling convention from all other OSes on x86-64?

assembly - mov unitialized variable?

I have a hard time understanding a piece of code.
I read the xv6 lecture at line 1054
Here is the code :
.globl entry
entry:
# Turn on page size extension for 4Mbyte pages
movl %cr4, %eax
orl $(CR4_PSE), %eax
movl %eax, %cr4
# Set page directory
movl $(V2P_WO(entrypgdir)), %eax
movl %eax, %cr3
# Turn on paging.
movl %cr0, %eax
orl $(CR0_PG|CR0_WP), %eax
movl %eax, %cr0
# Set up the stack pointer.
movl $(stack + KSTACKSIZE), %esp
# Jump to main(), and switch to executing at
# high addresses. The indirect call is needed because
# the assembler produces a PC-relative instruction
# for a direct jump.
mov $main, %eax
jmp *%eax
.comm stack, KSTACKSIZE
My question is:
How is it possible that we movl $(stack + KSTACKSIZE), %esp when stack is defined nowhere in the project, but at line 1063 as a .comm symbol and in a function that is called later and redefines the stack variable as a local one
static void
startothers(void)
{
char *stack; // THIS ONE IS A DIFFERENT BEAST, right ?
...
// Tell entryother.S what stack to use, where to enter, and what
// pgdir to use. We cannot use kpgdir yet, because the AP processor
// is running in low memory, so we use entrypgdir for the APs too.
stack = kalloc();
*(void**)(code-4) = stack + KSTACKSIZE;
*(void**)(code-8) = mpenter;
*(int**)(code-12) = (void *) v2p(entrypgdir);
?
I may miss a trick, but I don't get when its address is set.
At the linking stage so that stack is actually defined ?
Thanks
Yes .comm defines and allocates the stack with the given STACKSIZE in the .bss section. Upon first exeuction, the code runs as-is, and uses that stack. Judging from the function name of startothers I assume this is a multiprocessor bootup. Once the initial cpu has been brought up, it allocates a new stack for each other processor, and modifies the code itself so that it uses the newly allocated one.
In my opinion it would be a lot less confusing if the entry used variables for these things.

Calling main from assembly

I'm writing a small library intended to be used in place of libc in a small application. I've read the source of the major libc alternatives, but I am unable to get the parameter passing to work for the x86_64 architecture on Linux.
The library does not require any initialization step in between _start and main. Since the libc and its alternatives do use a initialization step, and my assembly knowledge being limited, I suspect the parameter reordering is causing me troubles.
This is what I've got, which contains assembly inspired from various implementations:
.text
.global _start
_start:
/* Mark the outmost frame by clearing the frame pointer. */
xorl %ebp, %ebp
/* Pop the argument count of the stack and place it
* in the first parameter-passing register. */
popq %rdi
/* Place the argument array in the second parameter-passing register. */
movq %rsi, %rsp
/* Align the stack at a 16-byte boundary. */
andq $~15, %rsp
/* Invoke main (defined by the host program). */
call main
/* Request process termination by the kernel. This
* is x86 assembly but it works for now. */
mov %ebx, %eax
mov %eax, 1
int $80
And the entry point is the ordinary main signature: int main(int argc, char* argv[]). Environment variables etc. are not required for this particular project.
The AMD64 ABI says rdi should be used for the first parameter, and rsi for the second.
How do I correctly setup the stack and pass the parameters to main on Linux x86_64? Thanks!
References:
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/sysdeps/x86_64/elf/start.S?view=markup
http://git.uclibc.org/uClibc/tree/libc/sysdeps/linux/x86_64/crt1.S
I think you got
/* Place the argument array in the second parameter-passing register. */
movq %rsi, %rsp
wrong. It should be
movq %rsp, %rsi # move argv to rsi, the second parameter in x86_64 abi
main is called by crt0.o; see also this question
The kernel is setting up the initial stack and process environment after execve as specified in the ABI document (architecture specific); the crt0 (and related) code is in charge of calling main.

Resources