My x86 assembly code loop is throwing a seg fault - loops

I'm doing an x86 assembly project for class and we're supposed to implement a heap of personnel records. The call heap_swap line is giving me trouble. If I uncomment it, it throws a seg fault. However, the heap_swap function works fine no matter how I test it. I've really racked my brain and would appreciate any help anyone can give!
sift_up1:
# ecx = i
# rdx = address to heap
# r9 = address to heap[i]
# rax = offset of id
# r8 = address for heap[i].id_number
# r10d = heap[i].id_number
# r11d = index of parent
# rdx = address for parent id number
# ebx = heap[parent].id_number
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
pushq %rbx #a section to keep track of all the callee saved registers
pushq %rdi #that need to be restored
leaq offset_of_id(%rip), %rax #put the id offset into a register
leaq heap(%rip), %rdx
jmp LOOP_TOP
LOOP_TOP:
cmpl $0, %ecx #Check if i=0, if so jump to exit loop
je EXIT_LOOP
movl $8, %r9d
imull %ecx, %r9d #finding heap[i]
addq (%rdx), %r9
movq %r9, %r8 #r8 contains heap[i]
addq (%rax), %r8 #add id offset, it becomes heap[i].id_number
movl (%r8), %r10d #dereference id_number and place it into r10d
movl %ecx, %r11d #find the index of the parent of i
subl $1, %r11d
shrl $1, %r11d
movl $8, %edi
imull %r11d, %edi
addq (%rdx), %rdi #rdi holds the address of heap[parent]
addq (%rax), %rdi #rdi holds the address of heap[parent].id_number
movl (%rdi), %ebx #ebx holds the heap[parent].id_number
cmpl %ebx, %r10d
jle EXIT_LOOP
pushq %rdx
movq %r11, %rdx #put the indexes in the correct parameter functions
# call heap_swap #call heap_swap
popq %rdx
movl %r11d, %ecx #modify i
jmp LOOP_TOP #jump to loop top

Related

How to declare a local array on the stack in x64 assembly?

I'm trying to declare an array of "quadwords" on the stack in my x64 assembly program. I know how to do this by declaring the array in the .data segment, but I'd like to make the array local to main if possible. I know that I could probably do this with malloc and dynamic memory allocation, as well, but I'd first like to see if it's even possible to do this on the stack. My issue is that I've declared enough memory on the stack to store the array (along with some extra space just for good measure). I store initial values into the array elements, but I don't know how to 'iterate' through the indices. I'd like to sum all the values in this array, just for practice. I tried using movq to retrieve the element, offset from the array's starting index, but I can't use negative indices in 'scaled index' mode.
...
subq $128, %rsp
movq $100, -8(%rbp) # arr[0] = 100
movq $79, -16(%rbp) # arr[1] = 79
movq $85, -24(%rbp) # arr[2] = 85
movq $62, -32(%rbp) # arr[3] = 62
movq $91, -40(%rbp) # arr[4] = 91
movq $0, -48(%rbp) # sum = 0
movq $5, %rcx # movq i = 5
loop:
cmp $1, %rcx
jz done
movq (%rbp, %rcx, 8), %rax # I believe this line may be wrong because the array starts at index -8(%rbp), right?
addq %rax, -48(%rbp)
subq $1, %rcx
jmp loop
...
In this answer, I show several ways to change your code to do what you want. The first one is a minimal change to your original code to get it to work; the final one is the way I would write it.
The first example only changes the starting and ending value of rcx. It leaves the array on the stack in the unusual top-down order, and iterates over the array from the end to the beginning.
...
subq $128, %rsp
movq $100, -8(%rbp) # arr[0] = 100
movq $79, -16(%rbp) # arr[1] = 79
movq $85, -24(%rbp) # arr[2] = 85
movq $62, -32(%rbp) # arr[3] = 62
movq $91, -40(%rbp) # arr[4] = 91
movq $0, -48(%rbp) # sum = 0
movq $-5, %rcx
loop:
cmp $0, %rcx
jz done
movq (%rbp, %rcx, 8), %rax
addq %rax, -48(%rbp)
addq $1, %rcx
jmp loop
The next example places the array in memory in the usual way, with index 0 at the lowest address, and iterates from index 0 to 4. Note the offset on the load instruction to cause index 0 to access rbp-40.
...
subq $128, %rsp
movq $100, -40(%rbp) # arr[0] = 100
movq $79, -32(%rbp) # arr[1] = 79
movq $85, -24(%rbp) # arr[2] = 85
movq $62, -16(%rbp) # arr[3] = 62
movq $91, -8(%rbp) # arr[4] = 91
movq $0, -48(%rbp) # sum = 0
movq $0, %rcx # i = 0
loop:
cmp $5, %rcx
jz done
movq -40(%rbp, %rcx, 8), %rax
addq %rax, -48(%rbp)
addq $1, %rcx
jmp loop
The final example changes a few other things to match the way I would write it:
...
subq $128, %rsp
movq $100, -40(%rbp) # arr[0] = 100
movq $79, -32(%rbp) # arr[1] = 79
movq $85, -24(%rbp) # arr[2] = 85
movq $62, -16(%rbp) # arr[3] = 62
movq $91, -8(%rbp) # arr[4] = 91
xor %eax, %eax # sum = 0
xor %ecx, %ecx # i = 0
loop:
addq -40(%rbp, %rcx, 8), %rax
add $1, %ecx
cmp $5, %ecx
jb loop
This version keeps the sum in a register instead of in memory. It makes use of the fact that writing to the lower half of a 64-bit register clears the upper half of the register.
It uses the usual technique to load 0 into a register.
And it puts the loop condition at the bottom of the loop instead of the top.

using printf before and inside a loop x86-64 assembly

I'm having trouble figuring out how to use printf correctly in this function. So the function is called multInts and is supposed to multiply the first element of the first array with the first element of the second array and continue through the whole array. However, the lab instructions specify that I can't call printf in the main function. So, I need to print out the word "Products:\n" and then in each new line after that, print out the product. I don't know how to use printf within the loop, however. The instructor said that we should "call printf in the loop after calculating product" and also to "save and restore caller-save registers before calling printf," but I'm not sure what that means.
Here's what my code looks like so far:
.file "lab4.s"
.section .rodata
.LC0:
.string "Products: \n"
.LC1:
.string "%i \n"
.data
sizeIntArrays:
.long 5
sizeShortArrays:
.word 4
intArray1:
.long 10
.long 25
.long 33
.long 48
.long 52
intArray2:
.long 20
.long -37
.long 42
.long -61
.long -10
##### MAIN FUNCTION
.text
.globl main
.type main,#function
main:
pushq %rbp
movq %rsp, %rbp
#pass parameters and call other functions
movl sizeIntArrays, %edi #move size to registers for 1st parameter
leaq intArray1, %rsi #load effective address of intArray1 to register rsi
leaq intArray2, %rdx #load effective address of intArray2 to register rdx
call multInts #call multInts function
movq $0, %rax #return 0 to caller
movq %rbp, %rsp
popq %rbp
ret
.size main,.-main
##### MULTINTS
.globl multInts
.type multInts,#function
multInts:
pushq %rbp
movq %rsp, %rbp
#add code here for what the functions should do
movq $0, %r8 #initialize index for array access in caller save reg
movq $0, %rcx #initialize 8 byte caller save result reg
loop0:
cmpl %r8d, %edi #compare index to size
je exit0 #exit if equal
movslq (%rsi,%r8,4),%rax # Load a long into RAX
movslq (%rdx,%r8,4),%r11 # Load a long into R11
imulq %r11, %rax # RAX *= R11
addq %rax, %rcx # RCX += RAX
incq %r8 #increment index
jmp loop0
exit0:
movq $.LC0, %rdi
movq %rcx, %rsi
movq $0, %rax
call printf
movq %rbp, %rsp
popq %rbp
ret
.size multInts,.-multInts
What I've tried to do is just move the printf instruction to before the loop, but it gives me a segmentation fault when trying to run the loop because %rdi and %rsi contain the addresses of the arrays that need to be used in the loop. How do I get around that and which registers should I use? Also, how do I call printf within the loop?
The output should look something like this:
Products:
200
-925
1386
-2928
-520
Assume that printf clobbers all the call-clobbered registers (What registers are preserved through a linux x86-64 function call), and use different ones for anything that needs to survive from one iteration of the loop to the next.
Look at compiler output for an example: write a version of your loop in C and compile it with -Og.
Obviously you need to move the instructions that set up the args in registers
(like the format string) along with the call printf.
The easiest way to protect a register from being accessed by a subroutine is to push it. According to the ABI V calling convention printf may change any register except RBX, RBP, R12–R15. The registers you need to preserve are RAX, RDX, RSI, RDI, R8 and R11 (RCX is no longer needed), so push before the call to printf and pop them afterwards:
pushq %rax
pushq %rdx
pushq %rsi
pushq %rdi
pushq %r8
pushq %r11
movq $.LC1, %rdi
movq %rax, %rsi
movq $0, %rax
call printf
popq %r11
popq %r8
popq %rdi
popq %rsi
popq %rdx
popq %rax
Now, you can copy the block into the loop. For each printf, you have to think about what needs to be secured:
...
multInts:
pushq %rbp
movq %rsp, %rbp
#add code here for what the functions should do
pushq %rdx # Preserve registers
pushq %rdi
pushq %rsi
movq $.LC0, %rdi # Format string (no further values)
movq $0, %rax # No vector registers used
call printf # Call C function
popq %rsi # Restore registers
popq %rdi
popq %rdx
movq $0, %r8 #initialize index for array access in caller save reg
loop0:
cmpl %r8d, %edi #compare index to size
je exit0 #exit if equal
movslq (%rsi,%r8,4),%rax # Load a long into RAX
movslq (%rdx,%r8,4),%r11 # Load a long into R11
imulq %r11, %rax # RAX *= R11
pushq %rax # Preserve registers
pushq %rdx
pushq %rsi
pushq %rdi
pushq %r8
pushq %r11
movq $.LC1, %rdi # Format string
movq %rax, %rsi # Value
movq $0, %rax # No vector registers used
call printf # Call C function
popq %r11 # Restore registers
popq %r8
popq %rdi
popq %rsi
popq %rdx
popq %rax
incq %r8 #increment index
jmp loop0
exit0:
movq %rbp, %rsp
popq %rbp
ret
...
BTW: .string "%i \n" will force printf only to process the lower 32-bit of RDI. Use .string %lli \n instead.

Does _printf require pre-additional space on the stack for it to work? [duplicate]

I know that OS X is 16 byte stack align, but I don't really understand why it is causing an error here.
All I am doing here is to pass an object size (which is 24) to %rdi, and call malloc. Does this error mean I have to ask for 32 bytes ?
And the error message is:
libdyld.dylib`stack_not_16_byte_aligned_error:
-> 0x7fffc12da2fa <+0>: movdqa %xmm0, (%rsp)
0x7fffc12da2ff <+5>: int3
libdyld.dylib`_dyld_func_lookup:
0x7fffc12da300 <+0>: pushq %rbp
0x7fffc12da301 <+1>: movq %rsp, %rbp
Here is the code:
Object_copy:
pushq %rbp
movq %rbp, %rsp
subq $8, %rsp
movq %rdi, 8(%rsp) # save self address
movq obj_size(%rdi), %rax # get object size
imul $8, %rax
movq %rax, %rdi
callq _malloc <------------------- error in this call
# rsi old object address
# rax new object address
# rdi object size, mutiple of 8
# rcx temp reg
# copy object tag
movq 0(%rsi), %rcx
movq %rcx, 0(%rax)
# set rdx to counter, starting from 8
movq $8, %rdx
# add 8 to object size, since we are starting from 8
addq $8, %rdi
start_loop:
cmpq %rdx, %rdi
jle end_loop
movq (%rdx, %rsi, 1), %rcx
movq %rcx, (%rdx, %rax, 1)
addq $8, %rdx
jmp start_loop
end_loop:
leave
ret
Main_protoObj:
.quad 5 ; object tag
.quad 3 ; object size
.quad Main_dispatch_table ; dispatch table
_main:
leaq Main_protoObj(%rip), %rdi
callq Object_copy # copy main proto object
subq $8, %rsp # save the main object on the stack
movq %rax, 8(%rsp)
movq %rax, %rdi # set rdi point to SELF
callq Main_init
callq Main_main
addq $8, %rsp # restore stack
leaq _term_msg(%rip), %rax
callq _print_string
Like you said, MacOS X has a 16 byte stack alignment, which means that the machine expects each variable on the stack to start on a byte that is a multiple of 16 from the current stack pointer.
When the stack is misaligned, it means we start trying to read variables from the middle of that 16 byte window and usually end up with a segmentation fault.
Before you call a routine in your code, you need to make sure that your stack is aligned correctly; in this case, meaning that the base pointer register is divisible by 16.
subq $8, %rsp # stack is misaligned by 8 bytes
movq %rdi, 8(%rsp) #
movq obj_size(%rdi), %rax #
imul $8, %rax #
movq %rax, %rdi #
callq _malloc # stack is still misaligned when this is called
To fix this, you can subq the %rsp by something like 16 instead of 8.
subq $16, %rsp # stack is still aligned
movq %rdi, 16(%rsp) #
... #
callq _malloc # stack is still aligned when this is called, good

Why is GCC exchanging rax and xmm0 registers? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
I was verifying some assembly generated by gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) and realized that the following instructions were being generated:
movq %xmm0, %rax
movq %rax, %xmm0
I'd like to know what is the purpose of these instructions considering that it seems irrelevant, is it some kind of optimization? Like when we do:
xor ax, ax
I'd like to let clear that this code appeared just when I used the option -mtune=native and my CPU is a Intel Core I5 4200U.
Following is my source code:
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include "print.h"
void multiply(const unsigned int* array1, const unsigned int* array2, unsigned int* array3, const unsigned int array_size)
{
unsigned int i = 0;
for (i = 0; i < array_size; i++)
{
array3[i] = array1[i] * array2[i];
}
}
int main()
{
const unsigned int array_size = 1024*1024;
unsigned int* array1 = (unsigned int*)malloc(sizeof(unsigned int) * array_size);
unsigned int* array2 = (unsigned int*)malloc(sizeof(unsigned int) * array_size);
unsigned int* array3 = (unsigned int*)malloc(sizeof(unsigned int) * array_size);
int i = 0;
srand(time(NULL));
for (i = 0; i < array_size; i++)
{
array1[i] = rand();
array2[i] = rand();
}
clock_t t0 = clock();
multiply(array1,array2,array3, array_size);
multiply(array1,array2,array3, array_size);
clock_t t1 = clock();
printf("\nTempo: %f\n", ((double)(t1 - t0)) / CLOCKS_PER_SEC);
}
This is the assembly generated by GCC using:gcc -S -mtune=native Main.c:
.file "Main.c"
.text
.globl multiply
.type multiply, #function
multiply:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movq %rdx, -40(%rbp)
movl %ecx, -44(%rbp)
movl $0, -4(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L3:
movl -4(%rbp), %eax
leaq 0(,%rax,4), %rdx
movq -40(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %eax
leaq 0(,%rax,4), %rcx
movq -24(%rbp), %rax
addq %rcx, %rax
movl (%rax), %ecx
movl -4(%rbp), %eax
leaq 0(,%rax,4), %rsi
movq -32(%rbp), %rax
addq %rsi, %rax
movl (%rax), %eax
imull %ecx, %eax
movl %eax, (%rdx)
addl $1, -4(%rbp)
.L2:
movl -4(%rbp), %eax
cmpl -44(%rbp), %eax
jb .L3
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2:
.size multiply, .-multiply
.section .rodata
.LC1:
.string "\nTempo: %f\n"
.text
.globl main
.type main, #function
main:
.LFB3:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $56, %rsp
.cfi_offset 3, -24
movl $1048576, -60(%rbp)
movl -60(%rbp), %eax
salq $2, %rax
movq %rax, %rdi
call malloc
movq %rax, -56(%rbp)
movl -60(%rbp), %eax
salq $2, %rax
movq %rax, %rdi
call malloc
movq %rax, -48(%rbp)
movl -60(%rbp), %eax
salq $2, %rax
movq %rax, %rdi
call malloc
movq %rax, -40(%rbp)
movl $0, -64(%rbp)
movl $0, %edi
call time
movl %eax, %edi
call srand
movl $0, -64(%rbp)
jmp .L5
.L6:
movl -64(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -56(%rbp), %rax
leaq (%rdx,%rax), %rbx
call rand
movl %eax, (%rbx)
movl -64(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -48(%rbp), %rax
leaq (%rdx,%rax), %rbx
call rand
movl %eax, (%rbx)
addl $1, -64(%rbp)
.L5:
movl -64(%rbp), %eax
cmpl -60(%rbp), %eax
jb .L6
call clock
movq %rax, -32(%rbp)
movl -60(%rbp), %ecx
movq -40(%rbp), %rdx
movq -48(%rbp), %rsi
movq -56(%rbp), %rax
movq %rax, %rdi
call multiply
movl -60(%rbp), %ecx
movq -40(%rbp), %rdx
movq -48(%rbp), %rsi
movq -56(%rbp), %rax
movq %rax, %rdi
call multiply
call clock
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
subq -32(%rbp), %rax
pxor %xmm0, %xmm0
cvtsi2sdq %rax, %xmm0
movsd .LC0(%rip), %xmm1
divsd %xmm1, %xmm0
movq %xmm0, %rax
movq %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
addq $56, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE3:
.size main, .-main
.section .rodata
.align 8
.LC0:
.long 0
.long 1093567616
.ident "GCC: (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010"
.section .note.GNU-stack,"",#progbits
And this with gcc -S Main.c:
.file "Main.c"
.text
.globl multiply
.type multiply, #function
multiply:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movq %rdx, -40(%rbp)
movl %ecx, -44(%rbp)
movl $0, -4(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L3:
movl -4(%rbp), %eax
leaq 0(,%rax,4), %rdx
movq -40(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %eax
leaq 0(,%rax,4), %rcx
movq -24(%rbp), %rax
addq %rcx, %rax
movl (%rax), %ecx
movl -4(%rbp), %eax
leaq 0(,%rax,4), %rsi
movq -32(%rbp), %rax
addq %rsi, %rax
movl (%rax), %eax
imull %ecx, %eax
movl %eax, (%rdx)
addl $1, -4(%rbp)
.L2:
movl -4(%rbp), %eax
cmpl -44(%rbp), %eax
jb .L3
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2:
.size multiply, .-multiply
.section .rodata
.LC1:
.string "\nTempo: %f\n"
.text
.globl main
.type main, #function
main:
.LFB3:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $56, %rsp
.cfi_offset 3, -24
movl $1048576, -60(%rbp)
movl -60(%rbp), %eax
salq $2, %rax
movq %rax, %rdi
call malloc
movq %rax, -56(%rbp)
movl -60(%rbp), %eax
salq $2, %rax
movq %rax, %rdi
call malloc
movq %rax, -48(%rbp)
movl -60(%rbp), %eax
salq $2, %rax
movq %rax, %rdi
call malloc
movq %rax, -40(%rbp)
movl $0, -64(%rbp)
movl $0, %edi
call time
movl %eax, %edi
call srand
movl $0, -64(%rbp)
jmp .L5
.L6:
movl -64(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -56(%rbp), %rax
leaq (%rdx,%rax), %rbx
call rand
movl %eax, (%rbx)
movl -64(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -48(%rbp), %rax
leaq (%rdx,%rax), %rbx
call rand
movl %eax, (%rbx)
addl $1, -64(%rbp)
.L5:
movl -64(%rbp), %eax
cmpl -60(%rbp), %eax
jb .L6
call clock
movq %rax, -32(%rbp)
movl -60(%rbp), %ecx
movq -40(%rbp), %rdx
movq -48(%rbp), %rsi
movq -56(%rbp), %rax
movq %rax, %rdi
call multiply
movl -60(%rbp), %ecx
movq -40(%rbp), %rdx
movq -48(%rbp), %rsi
movq -56(%rbp), %rax
movq %rax, %rdi
call multiply
call clock
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
subq -32(%rbp), %rax
pxor %xmm0, %xmm0
cvtsi2sdq %rax, %xmm0
movsd .LC0(%rip), %xmm1
divsd %xmm1, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
addq $56, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE3:
.size main, .-main
.section .rodata
.align 8
.LC0:
.long 0
.long 1093567616
.ident "GCC: (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010"
.section .note.GNU-stack,"",#progbits
The differences can be found at the end of .L5 label.

Understanding exactly how the increased efficiency is achieved in Assembly language

I have generated two assembly files - one that is optimized, and one that is not. The assembly-language code generated with optimization on should be more efficient than the other assembly-language code. I am more interested in how the efficiency is achieved. To my understanding, in the non-optimized version there will always have to be an offset call to the register %rbp to find the address. In the optimized version, the addresses are being stored in the registers, so you don't have to rely and call on %rbp to find them.
Am I correct? And if so, would there ever be a time when the optimized version will not be advantageous? Thank you for your time.
Here is a function that converts from 42 GIF to CYMK.
void rgb2cmyk(int r, int g, int b, int ret[]) {
int c = 255 - r;
int m = 255 - g;
int y = 255 - b;
int k = (c < m) ? (c < y ? c : y) : (m < y ? m : y);
c -= k; m -= k; y -= k;
ret[0] = c; ret[1] = m; ret[2] = y; ret[3] = k;
}
Here is the assembly-language code that has not been optimized. Note I have made notes using ;; in the code.
No Opt:
.section __TEXT,__text,regular,pure_instructions
.globl _rgb2cmyk
.align 4, 0x90
_rgb2cmyk: ## #rgb2cmyk
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
;;initializing variable c, m, y
movl $255, %eax
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movq %rcx, -24(%rbp)
movl %eax, %edx
subl -4(%rbp), %edx
movl %edx, -28(%rbp)
movl %eax, %edx
subl -8(%rbp), %edx
movl %edx, -32(%rbp)
subl -12(%rbp), %eax
movl %eax, -36(%rbp)
movl -28(%rbp), %eax
;;compare
cmpl -32(%rbp), %eax
jge LBB0_5
## BB#1:
movl -28(%rbp), %eax
cmpl -36(%rbp), %eax
jge LBB0_3
## BB#2:
movl -28(%rbp), %eax
movl %eax, -44(%rbp) ## 4-byte Spill
jmp LBB0_4
LBB0_3:
movl -36(%rbp), %eax
movl %eax, -44(%rbp) ## 4-byte Spill
LBB0_4:
movl -44(%rbp), %eax ## 4-byte Reload
movl %eax, -48(%rbp) ## 4-byte Spill
jmp LBB0_9
LBB0_5:
movl -32(%rbp), %eax
cmpl -36(%rbp), %eax
jge LBB0_7
## BB#6:
movl -32(%rbp), %eax
movl %eax, -52(%rbp) ## 4-byte Spill
jmp LBB0_8
LBB0_7:
movl -36(%rbp), %eax
movl %eax, -52(%rbp) ## 4-byte Spill
LBB0_8:
movl -52(%rbp), %eax ## 4-byte Reload
movl %eax, -48(%rbp) ## 4-byte Spill
LBB0_9:
movl -48(%rbp), %eax ## 4-byte Reload
movl %eax, -40(%rbp)
movl -40(%rbp), %eax
movl -28(%rbp), %ecx
subl %eax, %ecx
movl %ecx, -28(%rbp)
movl -40(%rbp), %eax
movl -32(%rbp), %ecx
subl %eax, %ecx
movl %ecx, -32(%rbp)
movl -40(%rbp), %eax
movl -36(%rbp), %ecx
subl %eax, %ecx
movl %ecx, -36(%rbp)
movl -28(%rbp), %eax
movq -24(%rbp), %rdx
movl %eax, (%rdx)
movl -32(%rbp), %eax
movq -24(%rbp), %rdx
movl %eax, 4(%rdx)
movl -36(%rbp), %eax
movq -24(%rbp), %rdx
movl %eax, 8(%rdx)
movl -40(%rbp), %eax
movq -24(%rbp), %rdx
movl %eax, 12(%rdx)
popq %rbp
retq
.cfi_endproc
.subsections_via_symbols
Optimization:
.section __TEXT,__text,regular,pure_instructions
.globl _rgb2cmyk
.align 4, 0x90
_rgb2cmyk: ## #rgb2cmyk
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
movl $255, %r8d
movl $255, %eax
subl %edi, %eax
movl $255, %edi
subl %esi, %edi
subl %edx, %r8d
cmpl %edi, %eax ##;; compare m and c
jge LBB0_2
## BB#1: ;; c < m
cmpl %r8d, %eax ## compare y and c
movl %r8d, %edx
cmovlel %eax, %edx
jmp LBB0_3
LBB0_2: ##;; c >= m
cmpl %r8d, %edi ## compare y and m
movl %r8d, %edx
cmovlel %edi, %edx
LBB0_3:
subl %edx, %eax
subl %edx, %edi
subl %edx, %r8d
movl %eax, (%rcx)
movl %edi, 4(%rcx)
movl %r8d, 8(%rcx)
movl %edx, 12(%rcx)
popq %rbp
retq
.cfi_endproc
.subsections_via_symbols
Yes. The optimized version performs many fewer memory read operations by storing intermediate values in registers and not reloading them over and over.
You are using call wrong. It is a technical term that means to push a return address on the stack and branch to a new location for instructions. The term you mean is simply to use the register.
Can you think of a reason that longer, slower code is "better"?

Resources