using printf before and inside a loop x86-64 assembly

using printf before and inside a loop x86-64 assembly - arrays

I'm having trouble figuring out how to use printf correctly in this function. So the function is called multInts and is supposed to multiply the first element of the first array with the first element of the second array and continue through the whole array. However, the lab instructions specify that I can't call printf in the main function. So, I need to print out the word "Products:\n" and then in each new line after that, print out the product. I don't know how to use printf within the loop, however. The instructor said that we should "call printf in the loop after calculating product" and also to "save and restore caller-save registers before calling printf," but I'm not sure what that means.
Here's what my code looks like so far:
.file "lab4.s"
.section .rodata
.LC0:
.string "Products: \n"
.LC1:
.string "%i \n"
.data
sizeIntArrays:
.long 5
sizeShortArrays:
.word 4
intArray1:
.long 10
.long 25
.long 33
.long 48
.long 52
intArray2:
.long 20
.long -37
.long 42
.long -61
.long -10
##### MAIN FUNCTION
.text
.globl main
.type main,#function
main:
pushq %rbp
movq %rsp, %rbp
#pass parameters and call other functions
movl sizeIntArrays, %edi #move size to registers for 1st parameter
leaq intArray1, %rsi #load effective address of intArray1 to register rsi
leaq intArray2, %rdx #load effective address of intArray2 to register rdx
call multInts #call multInts function
movq $0, %rax #return 0 to caller
movq %rbp, %rsp
popq %rbp
ret
.size main,.-main
##### MULTINTS
.globl multInts
.type multInts,#function
multInts:
pushq %rbp
movq %rsp, %rbp
#add code here for what the functions should do
movq $0, %r8 #initialize index for array access in caller save reg
movq $0, %rcx #initialize 8 byte caller save result reg
loop0:
cmpl %r8d, %edi #compare index to size
je exit0 #exit if equal
movslq (%rsi,%r8,4),%rax # Load a long into RAX
movslq (%rdx,%r8,4),%r11 # Load a long into R11
imulq %r11, %rax # RAX *= R11
addq %rax, %rcx # RCX += RAX
incq %r8 #increment index
jmp loop0
exit0:
movq $.LC0, %rdi
movq %rcx, %rsi
movq $0, %rax
call printf
movq %rbp, %rsp
popq %rbp
ret
.size multInts,.-multInts
What I've tried to do is just move the printf instruction to before the loop, but it gives me a segmentation fault when trying to run the loop because %rdi and %rsi contain the addresses of the arrays that need to be used in the loop. How do I get around that and which registers should I use? Also, how do I call printf within the loop?
The output should look something like this:
Products:
200
-925
1386
-2928
-520

Assume that printf clobbers all the call-clobbered registers (What registers are preserved through a linux x86-64 function call), and use different ones for anything that needs to survive from one iteration of the loop to the next.
Look at compiler output for an example: write a version of your loop in C and compile it with -Og.
Obviously you need to move the instructions that set up the args in registers
(like the format string) along with the call printf.

The easiest way to protect a register from being accessed by a subroutine is to push it. According to the ABI V calling convention printf may change any register except RBX, RBP, R12–R15. The registers you need to preserve are RAX, RDX, RSI, RDI, R8 and R11 (RCX is no longer needed), so push before the call to printf and pop them afterwards:
pushq %rax
pushq %rdx
pushq %rsi
pushq %rdi
pushq %r8
pushq %r11
movq $.LC1, %rdi
movq %rax, %rsi
movq $0, %rax
call printf
popq %r11
popq %r8
popq %rdi
popq %rsi
popq %rdx
popq %rax
Now, you can copy the block into the loop. For each printf, you have to think about what needs to be secured:
...
multInts:
pushq %rbp
movq %rsp, %rbp
#add code here for what the functions should do
pushq %rdx # Preserve registers
pushq %rdi
pushq %rsi
movq $.LC0, %rdi # Format string (no further values)
movq $0, %rax # No vector registers used
call printf # Call C function
popq %rsi # Restore registers
popq %rdi
popq %rdx
movq $0, %r8 #initialize index for array access in caller save reg
loop0:
cmpl %r8d, %edi #compare index to size
je exit0 #exit if equal
movslq (%rsi,%r8,4),%rax # Load a long into RAX
movslq (%rdx,%r8,4),%r11 # Load a long into R11
imulq %r11, %rax # RAX *= R11
pushq %rax # Preserve registers
pushq %rdx
pushq %rsi
pushq %rdi
pushq %r8
pushq %r11
movq $.LC1, %rdi # Format string
movq %rax, %rsi # Value
movq $0, %rax # No vector registers used
call printf # Call C function
popq %r11 # Restore registers
popq %r8
popq %rdi
popq %rsi
popq %rdx
popq %rax
incq %r8 #increment index
jmp loop0
exit0:
movq %rbp, %rsp
popq %rbp
ret
...
BTW: .string "%i \n" will force printf only to process the lower 32-bit of RDI. Use .string %lli \n instead.

Related

How do registers work as arguments in assembly?

I am trying to understand how assembly works with arguments and return values.
So far, I have learnt that %eax is is the return value and to load a single argument, I need to load the effective address of %rip + offset into %rid by using leaq var(%rip), %rdi .
To learn more about arguments, I created a c program that takes in 10 (11 arguments including the formatting string) to try and find out the order of registers. I then converted the C code into assembly using gcc on my Mac.
Here is the C code I used:
#include <stdio.h>
int main(){
printf("%s %s %s %s %s %s %s %s %s %s", "1 ", "2", "3", "4", "5", "6", "7", "8", "9", "10");
return 0;
}
And hear is the assembly output:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 13
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## #main
.cfi_startproc
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
.cfi_offset %rbx, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
subq $8, %rsp
leaq L_.str.10(%rip), %r10
leaq L_.str.9(%rip), %r11
leaq L_.str.8(%rip), %r14
leaq L_.str.7(%rip), %r15
leaq L_.str.6(%rip), %rbx
leaq L_.str(%rip), %rdi
leaq L_.str.1(%rip), %rsi
leaq L_.str.2(%rip), %rdx
leaq L_.str.3(%rip), %rcx
leaq L_.str.4(%rip), %r8
leaq L_.str.5(%rip), %r9
movl $0, %eax
pushq %r10
pushq %r11
pushq %r14
pushq %r15
pushq %rbx
callq _printf
addq $48, %rsp
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
## -- End function
.section __TEXT,__cstring,cstring_literals
L_.str: ## #.str
.asciz "%s %s %s %s %s %s %s %s %s %s"
L_.str.1: ## #.str.1
.asciz "1 "
L_.str.2: ## #.str.2
.asciz "2"
L_.str.3: ## #.str.3
.asciz "3"
L_.str.4: ## #.str.4
.asciz "4"
L_.str.5: ## #.str.5
.asciz "5"
L_.str.6: ## #.str.6
.asciz "6"
L_.str.7: ## #.str.7
.asciz "7"
L_.str.8: ## #.str.8
.asciz "8"
L_.str.9: ## #.str.9
.asciz "9"
L_.str.10: ## #.str.10
.asciz "10"
.subsections_via_symbols
After that, I then cleared the code up which removes some macOS only settings? The code still works.
.text
.globl _main ## -- Begin function main
_main: ## #main
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
subq $8, %rsp
leaq L_.str.10(%rip), %r10
leaq L_.str.9(%rip), %r11
leaq L_.str.8(%rip), %r14
leaq L_.str.7(%rip), %r15
leaq L_.str.6(%rip), %rbx
leaq L_.str(%rip), %rdi
leaq L_.str.1(%rip), %rsi
leaq L_.str.2(%rip), %rdx
leaq L_.str.3(%rip), %rcx
leaq L_.str.4(%rip), %r8
leaq L_.str.5(%rip), %r9
movl $0, %eax
pushq %r10
pushq %r11
pushq %r14
pushq %r15
pushq %rbx
callq _printf
addq $48, %rsp
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.data
L_.str: ## #.str
.asciz "%s %s %s %s %s %s %s %s %s %s"
L_.str.1: ## #.str.1
.asciz "1 "
L_.str.2: ## #.str.2
.asciz "2"
L_.str.3: ## #.str.3
.asciz "3"
L_.str.4: ## #.str.4
.asciz "4"
L_.str.5: ## #.str.5
.asciz "5"
L_.str.6: ## #.str.6
.asciz "6"
L_.str.7: ## #.str.7
.asciz "7"
L_.str.8: ## #.str.8
.asciz "8"
L_.str.9: ## #.str.9
.asciz "9"
L_.str.10: ## #.str.10
.asciz "10"
I understand that at the beginning of the code, that the base pointer is pushed onto the stack which is then copied into the stack pointer for later use.
The leaq is then loading each string into each register that will be used as an argument to printf.
What I want to know is why are registers r10 r11 r14 and r15 before the first argument is loaded into memory and that registers rsi rdx rcx r8 and 'r9' loaded into memory after the first argument? Also why are r14 and r15 used instead of r12 and r13?
Also why is 8 added and subtracted from the stack pointer in this case and does it matter which order the registers are pushed and popped?
I hope all the subquestions are related to this question, if not let me know. Also car me up on any knowledge I may be getting wrong. This is what I have learnt by converting c to assembly.

First, it looks like you are using unoptimized code so things are taking place that do not need to.
Look at the register state right before the call to printf that are not pushed on the stack:
rdi = format string
rsi = 1
rdx = 2
rcx = 3
r8 = 4
r9 = 5
Then 6 .. 10 are pushed on the stack in reverse order.
That should give you an idea of the calling convention. The first six parameters go through registers. The remaining parameters get passed on the stack.
What I want to know is why are registers r10 r11 r14 and r15 before the first argument is loaded into memory and that registers rsi rdx rcx r8 and 'r9' loaded into memory after the first argument?
That's just the order the compiler chose.
Also why are r14 and r15 used instead of r12 and r13?
Again, that's what the compiler chose. Not these are just being used a scratch locations. If the code were optimized, it is likely fewer registers would be used.
Also why is 8 added and subtracted from the stack pointer in this case and does it matter which order the registers are pushed and popped?
It could just be some boiler plate function code the compiler generates.

Does _printf require pre-additional space on the stack for it to work? [duplicate]

I know that OS X is 16 byte stack align, but I don't really understand why it is causing an error here.
All I am doing here is to pass an object size (which is 24) to %rdi, and call malloc. Does this error mean I have to ask for 32 bytes ?
And the error message is:
libdyld.dylib`stack_not_16_byte_aligned_error:
-> 0x7fffc12da2fa <+0>: movdqa %xmm0, (%rsp)
0x7fffc12da2ff <+5>: int3
libdyld.dylib`_dyld_func_lookup:
0x7fffc12da300 <+0>: pushq %rbp
0x7fffc12da301 <+1>: movq %rsp, %rbp
Here is the code:
Object_copy:
pushq %rbp
movq %rbp, %rsp
subq $8, %rsp
movq %rdi, 8(%rsp) # save self address
movq obj_size(%rdi), %rax # get object size
imul $8, %rax
movq %rax, %rdi
callq _malloc <------------------- error in this call
# rsi old object address
# rax new object address
# rdi object size, mutiple of 8
# rcx temp reg
# copy object tag
movq 0(%rsi), %rcx
movq %rcx, 0(%rax)
# set rdx to counter, starting from 8
movq $8, %rdx
# add 8 to object size, since we are starting from 8
addq $8, %rdi
start_loop:
cmpq %rdx, %rdi
jle end_loop
movq (%rdx, %rsi, 1), %rcx
movq %rcx, (%rdx, %rax, 1)
addq $8, %rdx
jmp start_loop
end_loop:
leave
ret
Main_protoObj:
.quad 5 ; object tag
.quad 3 ; object size
.quad Main_dispatch_table ; dispatch table
_main:
leaq Main_protoObj(%rip), %rdi
callq Object_copy # copy main proto object
subq $8, %rsp # save the main object on the stack
movq %rax, 8(%rsp)
movq %rax, %rdi # set rdi point to SELF
callq Main_init
callq Main_main
addq $8, %rsp # restore stack
leaq _term_msg(%rip), %rax
callq _print_string

Like you said, MacOS X has a 16 byte stack alignment, which means that the machine expects each variable on the stack to start on a byte that is a multiple of 16 from the current stack pointer.
When the stack is misaligned, it means we start trying to read variables from the middle of that 16 byte window and usually end up with a segmentation fault.
Before you call a routine in your code, you need to make sure that your stack is aligned correctly; in this case, meaning that the base pointer register is divisible by 16.
subq $8, %rsp # stack is misaligned by 8 bytes
movq %rdi, 8(%rsp) #
movq obj_size(%rdi), %rax #
imul $8, %rax #
movq %rax, %rdi #
callq _malloc # stack is still misaligned when this is called
To fix this, you can subq the %rsp by something like 16 instead of 8.
subq $16, %rsp # stack is still aligned
movq %rdi, 16(%rsp) #
... #
callq _malloc # stack is still aligned when this is called, good

Reverse Array X86 AT&T Syntax

I'm writing a program in Assembly that has has 2 arrays declared at the beginning and 3 functions, which are:
printQArray(int size, long *array1)
invertArray(int size, long *array1)
multQuad(int size, long *array1, long *array2)
Now the program takes these arrays and prints the products of the 2 arrays for each corresponding positions and prints them.
Then it prints Array1.
Then it prints Array1 Reversed.
Then it should take the reversed array and call the multiplication function again and print the product of the positions of 1st array reversed and the 2nd array which never changes.(Array values in source code)
I'm having problems after I reverse the array and attempt to multiply the reversed 1st array and 2nd array.
The following is the output of my program
Products
200
-925
1386
-2928
9375
64350
Elements in QArray1
10
25
33
48
125
550
Elements in QArray1
550
125
48
33
25
10
Products
0
-1036
-31584
44896
0
0
So this last output is clearly not the products of array1 reversed and array2
As you can see in my code below(PS I have already tried movq in place of leaq) my reversed array is being returned in %rax and I put it into %rcx
This is all fine and dandy because I successfully print out a reversed array below
#PRINT Inverted ARRAY1 void printArray(int size, long *array1);
movq $sizeQArrays, %rax
movq (%rax), %rdi #sizeQArrays to %rdi (parameter 1)
leaq (%rcx), %rsi #put reversed array into rsi
call printQArray
movq $0, %rax
However once I call the multQuads again I get weird results, I'm confident my reversed array isn't getting moved into the register properly. The original array was a constant and thus simple but I think me pushing all the value's onto the stack and popping them back off in reverse order has changed the structure somehow. Or maybe I have a typo. Source Code below:
.section .rodata
.LC1: .string "Products\n"
.LC3: .string "Elements in QArray1\n"
.LC4: .string "%i\n"
.LC5: .string "\n"
.data
sizeQArrays:
.quad 6
QArray1:
.quad 10
.quad 25
.quad 33
.quad 48
.quad 125
.quad 550
QArray2:
.quad 20
.quad -37
.quad 42
.quad -61
.quad 75
.quad 117
.globl main
.type main, #function
.globl printQArray
.type printQArray, #function
.globl multQuads
.type multQuads, #function
.globl invertArray
.type invertArray, #function
.text
main:
pushq %rbp #stack housekeeping
movq %rsp, %rbp
#order of calls: quad print invert print quad
#MULTQUADS void multQuads(int size, long *array1, long *array2)
movq $sizeQArrays, %rax
movq (%rax), %rdi #1st param
movq $QArray1, %rsi #2nd Param
movq $QArray2, %rdx #3rd Param
call multQuads
movq $0, %rax
#PRINT ARRAY1 void printArray(int size, long *array1);
movq $sizeQArrays, %rax
movq (%rax), %rdi #sizeQArrays to %rdi (parameter 1)
movq $QArray1, %rsi #address of QArray1 to %rsi (parameter 2)
#purposely not pushing anything because I have not put anything in registers
#except parameters and I will be putting new values there after return
call printQArray
movq $0, %rax
#InvertArray void invertArray(long size, long *array1)
movq $sizeQArrays, %rax
movq (%rax), %rdi #1st param
movq $QArray1, %rsi #2nd Param
call invertArray
leaq (%rax), %rcx #put inverted array into %rcx
movq $0, %rax #set %rax back to 0
#PRINT Inverted ARRAY1 void printArray(int size, long *array1);
movq $sizeQArrays, %rax
movq (%rax), %rdi #sizeQArrays to %rdi (parameter 1)
movq %rcx, %rsi #put reversed array into rsi
call printQArray
movq $0, %rax
#MULTQUADS W/ REVERSED ARRAY void multQuads(int size, long *array1, long *array2);
movq $sizeQArrays, %rax
movq (%rax), %rdi #1st param
movq %rcx, %rsi #inversed array as 2nd param
movq $QArray2, %rdx #3rd Param
call multQuads
movq $0, %rax
#END of main
leave
ret
.size main, .-main
#printQArray prints an array of 8 byte values
# the size of the array is passed in %rdi,
# a pointer to the beginning of the array is passed in %rsi
printQArray:
pushq %rbp
movq %rsp, %rbp
pushq %r12
pushq %r13
pushq %rbx
movq %rdi, %r12 #copy size to %r12
movq %rsi, %r13 #copy array pointer to %r13
# print array title
movq $.LC3, %rdi
movq $0, %rax
# purposely not pushing any caller save registers.
callq printf
movq $0, %rbx #array index
printQArrayLoop:
movq (%r13, %rbx, 8), %rsi #element of array in 2nd parameter register
movq $.LC4, %rdi #format literal in 1st parameter register
movq $0, %rax
#purposely not pushing any caller save registers
callq printf
incq %rbx #increment index
decq %r12 #decrement count
jle printQArrayExit
jmp printQArrayLoop
printQArrayExit:
# print final \n
movq $.LC5, %rdi #parameter 1
movq $0, %rax
call printf
popq %rbx
popq %r13
popq %r12
leave
ret
.size printQArray, .-printQArray
multQuads:
pushq %rbp
movq %rsp, %rbp
pushq %r12
pushq %r13
pushq %r14
pushq %rbx
movq %rdi, %r12 #copy size to %r12
movq %rsi, %r13 #copy array1 pointer to %r13
movq %rdx, %r14 #copy array2 pointer to %r14
# print "Products"
movq $.LC1, %rdi
movq $0, %rax
call printf
movq $0, %rbx #array index
multQuadLoop:
movq (%r13, %rbx, 8), %rsi #element of array in 2nd parameter register
movq (%r14, %rbx, 8), %rdx #element of array in 3rd parameter register
movq $.LC4, %rdi #format literal in 1st parameter register
imulq %rdx, %rsi #insert product into second parameter
movq $0, %rax
callq printf
incq %rbx #increment index
decq %r12 #decrement count
jle multQuadExit
jmp multQuadLoop
multQuadExit:
# print final \n
movq $.LC5, %rdi #parameter 1
movq $0, %rax
call printf
popq %rbx
popq %r13
popq %r12
popq %r14
leave
ret
.size multQuad, .-multQuad
invertArray:
pushq %rbp
movq %rsp, %rbp
pushq %r12 #size
pushq %r13 #array pointer
pushq %rbx #array index
pushq %r9 #holder
pushq %r10 #holder
push %r14
movq %rdi, %r12 #copy size to %r12
movq %rdi, %r9
movq %rsi, %r13 #copy array pointer to %r13
movq $0, %rbx #array index
movq $0, %r10
invertArrayLoop:
pushq (%r13, %rbx, 8) #push elements of array onto stack
incq %rbx #increment index
decq %r12 #decrement count
jle reverseArray
jmp invertArrayLoop
reverseArray:
popq %r14
movq %r14, (%r13, %r10, 8)
incq %r10
decq %r9
subq %r12, %r9
jle invertArrayExit
jmp reverseArray
invertArrayExit:
movq %r13, %rax
popq %r14
popq %r10
popq %r9
popq %rbx
popq %r13
popq %r12
leave
ret
.size invertArray, .-invertArray
If the multQuad function works the 1st time and I can print out the reversed array properly then I imagine the problem must be right before im calling multQuad and setting the registers

I was losing the array in printQArray
It was just one line!!

Manual Assembly vs GCC

Disclaimer: I'm just starting out with x86 assembly. I did learn a bit of SPIM at university, but that's hardly worth mentioning.
I thought I start with what's probably the most simple function in libc, abs(). Pretty straightforward in C:
long myAbs(long j) {
return j < 0 ? -j : j;
}
My version in assembly:
.global myAbs
.type myAbs, #function
.text
myAbs:
test %rdi, %rdi
jns end
negq %rdi
end:
movq %rdi, %rax
ret
(This doesn't work for 32bit integers, probably because RAX is a 64bit register and the sign is probably at the wrong position - I have to investigate that).
Now here's what gcc does (gcc -O2 -S myAbs.c):
.file "myAbs.c"
.section .text.unlikely,"ax",#progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl myAbs
.type myAbs, #function
myAbs:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $4144, %rsp
orq $0, (%rsp)
addq $4128, %rsp
movq %rdi, %rdx
sarq $63, %rdx
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movq %rdi, %rax
xorq %rdx, %rax
subq %rdx, %rax
movq -8(%rbp), %rcx
xorq %fs:40, %rcx
jne .L5
leave
.cfi_remember_state
.cfi_def_cfa 7, 8
ret
.L5:
.cfi_restore_state
call __stack_chk_fail#PLT
.cfi_endproc
.LFE0:
.size myAbs, .-myAbs
.section .text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.ident "GCC: (Gentoo Hardened 5.1.0 p1.2, pie-0.6.3) 5.1.0"
.section .note.GNU-stack,"",#progbits
Why this big difference? GCC produces substantially more instructions. I can't imagine that this won't be slower than my code.
Am I missing something? Or am I doing something seriousely wrong here?

For those who wonder what the generated code comes from, first note that when GCC compile myAbs with stack protection it transform it into this form
long myAbs(long j) {
uintptr_t canary = __stack_chk_guard;
register long result = j < 0 ? -j : j;
if ( (canary = canary ^ __stack_chk_guard) != 0 )
__stack_chk_fail();
}
The code to simply perform j < 0 ? -j : j; is
movq %rdi, %rdx ;RDX = j
movq %rdi, %rax ;RAX = j
sarq $63, %rdx ;RDX = 0 if j >=0, 0fff...ffh if j < 0
xorq %rdx, %rax ;Note: x xor 0ff...ffh = Not X, x xor 0 = x
;RAX = j if j >=0, ~j if j < 0
subq %rdx, %rax ;Note: 0fff...ffh = -1
;RAX = j+0 = j if j >= 0, ~j+1 = -j if j < 0
;~j+1 = -j in two complement
Analyzing the generated code we get
pushq %rbp
movq %rsp, %rbp ;Standard prologue
subq $4144, %rsp ;Allocate slight more than 4 KiB
orq $0, (%rsp) ;Perform a useless RW operation to test if there is enough stack space for __stack_chk_fail
addq $4128, %rsp ;This leave 16 byte allocated for local vars
movq %rdi, %rdx ;See above
sarq $63, %rdx ;See above
movq %fs:40, %rax ;Get the canary
movq %rax, -8(%rbp) ;Save it as a local var
xorl %eax, %eax ;Clear it
movq %rdi, %rax ;See above
xorq %rdx, %rax ;See above
subq %rdx, %rax ;See above
movq -8(%rbp), %rcx ;RCX = Canary
xorq %fs:40, %rcx ;Check if equal to the original value
jne .L5 ;If not fail
leave
ret
.L5:
call __stack_chk_fail#PLT ;__stack_chk_fail is noreturn
So all the extra instructions are for implementing the Stack Smashing Protector.
Thanks to FUZxxl for pointing out the use of the first instructions after the prologue.

Many of the beginning calls are to setup the stack and save the return address (something which you are not doing). Seems like theres are some stack protection going on. Perhaps you could tune your compiler settings to get rid of some overhead.
Perhaps adding flags to you compiler such as: -fno-stack-protector could minimise this difference.
Yes this probably is slower than your handwritten assembly, but offers much more protection and is probably worth the slight overhead.
As for why the stack protection still exists even though it is a leaf function see here.

Segfault inline assembly

I'm trying to create a green thread implementation based off this tutorial, However my switch function is giving me a segfault because the code to load the registers is not run at the end of the function. Here is my code:
void ThreadSwitch(Thread in, Thread out) {
if (!out && !in) {
return;
}
if (out) {
// save registers for out
}
if (in) {
SetCurrentThread(in);
mtx_lock(&in->mutex);
uint64_t rsp = in->cpu.rsp;
uint64_t r15 = in->cpu.r15;
uint64_t r14 = in->cpu.r14;
uint64_t r13 = in->cpu.r13;
uint64_t r12 = in->cpu.r12;
uint64_t rbx = in->cpu.rbx;
uint64_t rbp = in->cpu.rbp;
mtx_unlock(&in->mutex);
asm volatile("mov %[rsp], %%rsp\n"
"mov %[r15], %%r15\n"
"mov %[r14], %%r14\n"
"mov %[r13], %%r13\n"
"mov %[r12], %%r12\n"
"mov %[rbx], %%rbx\n"
"mov %[rbp], %%rbp\n" : : [rsp] "r"(rsp), [r15] "r"(r15), [r14] "r"(r14), [r13] "r"(r13), [r12] "r"(r12), [rbx] "r"(rbx), [rbp] "r"(rbp));
}
}
Xcode says that the inline assembly is causing a segfault, but my lldb disassembly looks like this (you can ignore 95% of it, just provided for context):
0x1000f88b4: movq -0x8(%rbp), %rdi
0x1000f88b8: callq 0x1000f83a0 ; SetCurrentThread at thread.cc:21
0x1000f88bd: movq -0x8(%rbp), %rdi
0x1000f88c1: addq $0x50, %rdi
0x1000f88c8: callq 0x1000f7b80 ; mtx_lock at tct.c:106
0x1000f88cd: movq -0x8(%rbp), %rdi
0x1000f88d1: movq (%rdi), %rdi
0x1000f88d4: movq %rdi, -0x18(%rbp)
0x1000f88d8: movq -0x8(%rbp), %rdi
0x1000f88dc: movq 0x8(%rdi), %rdi
0x1000f88e0: movq %rdi, -0x20(%rbp)
0x1000f88e4: movq -0x8(%rbp), %rdi
0x1000f88e8: movq 0x10(%rdi), %rdi
0x1000f88ec: movq %rdi, -0x28(%rbp)
0x1000f88f0: movq -0x8(%rbp), %rdi
0x1000f88f4: movq 0x18(%rdi), %rdi
0x1000f88f8: movq %rdi, -0x30(%rbp)
0x1000f88fc: movq -0x8(%rbp), %rdi
0x1000f8900: movq 0x20(%rdi), %rdi
0x1000f8904: movq %rdi, -0x38(%rbp)
0x1000f8908: movq -0x8(%rbp), %rdi
0x1000f890c: movq 0x28(%rdi), %rdi
0x1000f8910: movq %rdi, -0x40(%rbp)
0x1000f8914: movq -0x8(%rbp), %rdi
0x1000f8918: movq 0x30(%rdi), %rdi
0x1000f891c: movq %rdi, -0x48(%rbp)
0x1000f8920: movq -0x8(%rbp), %rdi
0x1000f8924: addq $0x50, %rdi
0x1000f892b: movl %eax, -0x54(%rbp)
0x1000f892e: callq 0x1000f7de0 ; mtx_unlock at tct.c:264
0x1000f8933: movq -0x18(%rbp), %rdi ; beginning of inline asm
0x1000f8937: movq -0x20(%rbp), %rcx
0x1000f893b: movq -0x28(%rbp), %rdx
0x1000f893f: movq -0x30(%rbp), %rsi
0x1000f8943: movq -0x38(%rbp), %r8
0x1000f8947: movq -0x40(%rbp), %r9
0x1000f894b: movq -0x48(%rbp), %r10
0x1000f894f: movq %rdi, %rsp
0x1000f8952: movq %rcx, %r15
0x1000f8955: movq %rdx, %r14
0x1000f8958: movq %rsi, %r13
0x1000f895b: movq %r8, %r12
0x1000f895e: movq %r9, %rbx
0x1000f8961: movq %r10, %rbp ; end of inline asm
-> 0x1000f8964: movl %eax, -0x58(%rbp)
0x1000f8967: addq $0x60, %rsp
0x1000f896b: popq %rbp
0x1000f896c: retq
The segfault happens when it tries to access stuff back on the stack, which makes sense because it just switched out the stack. But why is the compiler inserting this? The compiler also stores %eax on the stack at 0x1000f892b. Is the compiler opening up a register? Because it doesn't use %rax in the inline asm. Is there a workaround?
This is using Apple LLVM version 6.0 (clang-600.0.57) on OSX 10.10.2, if that's any help.
Thanks in advance.

I strongly advise you not to write programs that depend on undefined behaviour.
Jumps into and out of inline assembly are not permitted as the compiler can't analyse control flow it doesn't know about, upon thread creation you jump into the asm statement from nowhere then leaves it. To avoid these implicit jumps you need to save and restore the registers including %rip in the same asm statement.
All registers that an asm statement alters must be listed as outputs or clobbers, for a thread switch routine that is all the registers whose values are not saved, as they are altered by the other threads. If you do not do so the compiler will incorrectly assume that they are not altered.
An asm statement must avoid overwriting it's inputs before they are used, in your code there is nothing prohibiting the compiler from storing the variable r12 in the register %r14.
Your lock is either pointless or inadequate.
It is much simpler to write your function entirely in assembly, like in tutorial you cite.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

using printf before and inside a loop x86-64 assembly - arrays

Related

How do registers work as arguments in assembly?

Does _printf require pre-additional space on the stack for it to work? [duplicate]

Reverse Array X86 AT&T Syntax

Manual Assembly vs GCC

Segfault inline assembly

Categories

Resources