Variable swap with and without auxiliary variable - which is faster?

Variable swap with and without auxiliary variable - which is faster? - c

I guess you all heard of the 'swap problem'; SO is full of questions about it.
The version of the swap without use of a third variable is often considered to be faster since, well, you have one variable less. I wanted to know what was going on behind the curtains and wrote the following two programs:
int main () {
int a = 9;
int b = 5;
int swap;
swap = a;
a = b;
b = swap;
return 0;
}
and the version without third variable:
int main () {
int a = 9;
int b = 5;
a ^= b;
b ^= a;
a ^= b;
return 0;
}
I generated the assembly code using clang and got this for the first version (that uses a third variable):
...
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl $0, %eax
movl $0, -4(%rbp)
movl $9, -8(%rbp)
movl $5, -12(%rbp)
movl -8(%rbp), %ecx
movl %ecx, -16(%rbp)
movl -12(%rbp), %ecx
movl %ecx, -8(%rbp)
movl -16(%rbp), %ecx
movl %ecx, -12(%rbp)
popq %rbp
ret
Leh_func_end0:
...
and this for the second version (that does not use a third variable):
...
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl $0, %eax
movl $0, -4(%rbp)
movl $9, -8(%rbp)
movl $5, -12(%rbp)
movl -12(%rbp), %ecx
movl -8(%rbp), %edx
xorl %ecx, %edx
movl %edx, -8(%rbp)
movl -8(%rbp), %ecx
movl -12(%rbp), %edx
xorl %ecx, %edx
movl %edx, -12(%rbp)
movl -12(%rbp), %ecx
movl -8(%rbp), %edx
xorl %ecx, %edx
movl %edx, -8(%rbp)
popq %rbp
ret
Leh_func_end0:
...
The second one is longer but I don't know much about assembly code so I have no idea if that means that it is slower so I'd like to hear the opinion of someone more knowledgable about it.
Which of the above versions of a variable swap is faster and takes less memory?

Look at some optimised assembly. From
void swap_temp(int *restrict a, int *restrict b){
int temp = *a;
*a = *b;
*b = temp;
}
void swap_xor(int *restrict a, int *restrict b){
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
gcc -O3 -std=c99 -S -o swapping.s swapping.c produced
.file "swapping.c"
.text
.p2align 4,,15
.globl swap_temp
.type swap_temp, #function
swap_temp:
.LFB0:
.cfi_startproc
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
ret
.cfi_endproc
.LFE0:
.size swap_temp, .-swap_temp
.p2align 4,,15
.globl swap_xor
.type swap_xor, #function
swap_xor:
.LFB1:
.cfi_startproc
movl (%rsi), %edx
movl (%rdi), %eax
xorl %edx, %eax
xorl %eax, %edx
xorl %edx, %eax
movl %edx, (%rsi)
movl %eax, (%rdi)
ret
.cfi_endproc
.LFE1:
.size swap_xor, .-swap_xor
.ident "GCC: (SUSE Linux) 4.5.1 20101208 [gcc-4_5-branch revision 167585]"
.section .comment.SUSE.OPTs,"MS",#progbits,1
.string "Ospwg"
.section .note.GNU-stack,"",#progbits
To me, swap_temp looks as efficient as can be.

The problem with XOR swap trick is that it's strictly sequential. It may seem deceptively fast, but in reality, it is not. There's an instruction called XCHG that swaps two registers, but this can also be slower than simply using 3 MOVs, due to its atomic nature. The common technique with temp is an excellent choice ;)

To get an idea of the cost imagine that every command has a cost to be performed and also the indirect addressing has its own cost.
movl -12(%rbp), %ecx
This line will need something like a time unit for accessing the value in ecx register,
one time unit for accessing rbp, another one for applying the offset (-12) and more time
units (let's say arbitrarily 3) for moving the value from the address stored in ecx to the
address indicated from -12(%rbp).
If you count all the operations in every line and all line, the second method is for sure costlier than the first one.

Related

Translating assembly

So I'm learning how to convert the assembly into readable C code. The assembly is as follows...
Consider the compiler places the C variables: a at -4(%rbp), b at -8(%rbp), and c at -12(rbp).
file "main.c"
.text
.globl main
.type main, #function
main:
endbr64
pushq %rbp
movq %rsp, %rbp
movl $10, -12(%rbp)
movl $20, -4(%rbp)
movl $1, -8(%rbp)
.L4:
cmpl $1, -12(%rbp)
je .L7
movl -8(%rbp), %eax
imull -12(%rbp), %eax
movl %eax, -8(%rbp)
subl $1, -12(%rbp)
jmp .L4
.L7:
nop
movl -4(%rbp), %eax
imull -8(%rbp), %eax
movl %eax, -4(%rbp)
movl $0, %eax
popq %rbp
ret
This is what I have so far.
int c = 10;
int a = 20;
int b = 1;
for(c = 10; c > 1; c--)
{
int x = b;
x = c * x;
b = x;
}
Not completely sure how correct that is. The part that confuses me the most is the appearance (from what seems like out of nowhere) of eax. When eax appears, should I just assume that it is some other random variable? (hence the integer x I introduced)

Assembly Code from C program

I have a C program which has a function decod and the function has the following statements.
My decode.c script:
int decod(int x, int y, int z) {
int ty = y;
ty = ty - z;
int py = ty;
py = py << 31;
py = py >> 31;
ty = ty * x;
py = py ^ ty;
}
The assembly code of this program (generated by gcc -S decod.c) shows the following code.
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movl %edx, -28(%rbp)
movl -24(%rbp), %eax
movl %eax, -8(%rbp)
movl -28(%rbp), %eax
subl %eax, -8(%rbp)
movl -8(%rbp), %eax
movl %eax, -4(%rbp)
sall $31, -4(%rbp)
sarl $31, -4(%rbp)
movl -8(%rbp), %eax
imull -20(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
xorl %eax, -4(%rbp)
popq %rbp
.cfi_def_cfa 7, 8
ret
But, I want the program generate an assembly file with only the following lines of code.
subl %edx, %esi
movl %esi, %eax
sall $31, %eax
sarl $31, %eax
imull %edi, %esi
xorl %esi, %eax
ret
I know I am pretty close to write a program which will generate the above mentioned code. But, I am clueless why the script generates different assembly code. Any direction will be helpful.

If you compile your function as is, in optimization level3, -O3 the entire function is optimized out. This is because there is no return value and py and ty are anyways discarded after the function.
For reference the code is below
.globl decod
.def decod; .scl 2; .type 32; .endef
.seh_proc decod
decod:
.seh_endprologue
ret
.seh_endproc
If however, you add a return py; at the end the code generated is as follows.
.globl decod
.def decod; .scl 2; .type 32; .endef
.seh_proc decod
decod:
.seh_endprologue
subl %r8d, %edx
movl %edx, %eax
imull %edx, %ecx
sall $31, %eax
sarl $31, %eax
xorl %ecx, %eax
ret
.seh_endproc
This is functionally identical to what you are expecting.

Why does GCC produce the following asm output?

I don't understand why gcc -S -m32 produces these particular lines of code:
movl %eax, 28(%esp)
movl $desc, 4(%esp)
movl 28(%esp), %eax
movl %eax, (%esp)
call sort_gen_asm
My question is why %eax is pushed and then popped? And why movl used instead of pushl and popl respectively? Is it faster? Is there some coding convention I don't yet know? I've just started looking at asm-output closely, so I don't know much.
The C code:
void print_array(int *data, size_t sz);
void sort_gen_asm(array_t*, comparer_t);
int main(int argc, char *argv[]) {
FILE *file;
array_t *array;
file = fopen("test", "rb");
if (file == NULL) {
err(EXIT_FAILURE, NULL);
}
array = array_get(file);
sort_gen_asm(array, desc);
print_array(array->data, array->sz);
array_destroy(array);
fclose(file);
return 0;
}
It gives this output:
.file "main.c"
.section .rodata
.LC0:
.string "rb"
.LC1:
.string "test"
.text
.globl main
.type main, #function
main:
.LFB2:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $32, %esp
movl $.LC0, 4(%esp)
movl $.LC1, (%esp)
call fopen
movl %eax, 24(%esp)
cmpl $0, 24(%esp)
jne .L2
movl $0, 4(%esp)
movl $1, (%esp)
call err
.L2:
movl 24(%esp), %eax
movl %eax, (%esp)
call array_get
movl %eax, 28(%esp)
movl $desc, 4(%esp)
movl 28(%esp), %eax
movl %eax, (%esp)
call sort_gen_asm
movl 28(%esp), %eax
movl 4(%eax), %edx
movl 28(%esp), %eax
movl (%eax), %eax
movl %edx, 4(%esp)
movl %eax, (%esp)
call print_array
movl 28(%esp), %eax
movl %eax, (%esp)
call array_destroy
movl 24(%esp), %eax
movl %eax, (%esp)
call fclose
movl $0, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE2:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.8.1-10ubuntu8) 4.8.1"
.section .note.GNU-stack,"",#progbits

The save / load of eax is because you did not compile with optimizations. So any read/write of a variable will emit a read/write of a memory address.
Actually, for (almost) any line of code you will be able to identify the exact piece of assembler code resulting from it (let me advise you to compile with gcc -g -c -O0 and then objdump -S file.o):
#array = array_get(file);
call array_get
movl %eax, 28(%esp) #write array
#sort_gen_asm(array, desc);
movl 28(%esp), %eax #read array
movl %eax, (%esp)
...
About not pushing/poping, it is a standard zero-cost optimization. Instead of push/pop every time you want to call a function you just substract the maximum needed space to esp at the beginning of the function and then save your function arguments at the bottom of the empty space. There are a lot of advantages: faster code (no changing esp), it doesn't need to compute the argument in any particular order, and the esp will need to be substracted anyway for the local variables space.

Some things have to do with calling conventions. Others with optimisations.
sort_gen_asm seems to use cdecl calling convention which requires it's arguments to be pushed onto the stack in reverse order. thus:
movl $desc, 4(%esp)
movl %eax, (%esp)
The other moves are partially unoptimised compiler routines:
movl %eax, 28(%esp) # save contents of %eax on the stack before calling
movl 28(%esp), %eax # retrieve saved 28(%esp) in order to prepare it as an argument
# Unoptimised compiler seems to have forgotten that it's
# still in the register

assembly code of the c function

I'm trying to understand the assembly code of the C function. I could not understand why andl -16 is done at the main. Is it for allocating space for the local variables. If so why subl 32 is done for main.
I could not understand the disassembly of the func1. As read the stack grows from higher order address to low order address for 8086 processors. So here why is the access on positive side of the ebp(for parameters offset) and why not in the negative side of ebp. The local variables inside the func1 is 3 + return address + saved registers - So it has to be 20, but why is it 24? (subl $24,esp)
#include<stdio.h>
int add(int a, int b){
int res = 0;
res = a + b;
return res;
}
int func1(int a){
int s1,s2,s3;
s1 = add(a,a);
s2 = add(s1,a);
s3 = add(s1,s2);
return s3;
}
int main(){
int a,b;
a = 1;b = 2;
b = func1(a);
printf("\n a : %d b : %d \n",a,b);
return 0;
}
assembly code :
.file "sample.c"
.text
.globl add
.type add, #function
add:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $0, -4(%ebp)
movl 12(%ebp), %eax
movl 8(%ebp), %edx
leal (%edx,%eax), %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
leave
ret
.size add, .-add
.globl func1
.type func1, #function
func1:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl 8(%ebp), %eax
movl %eax, 4(%esp)
movl 8(%ebp), %eax
movl %eax, (%esp)
call add
movl %eax, -4(%ebp)
movl 8(%ebp), %eax
movl %eax, 4(%esp)
movl -4(%ebp), %eax
movl %eax, (%esp)
call add
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
movl %eax, 4(%esp)
movl -4(%ebp), %eax
movl %eax, (%esp)
call add
movl %eax, -12(%ebp)
movl -12(%ebp), %eax
leave
ret
.size func1, .-func1
.section .rodata
.LC0:
.string "\n a : %d b : %d \n"
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $32, %esp
movl $1, 28(%esp)
movl $2, 24(%esp)
movl 28(%esp), %eax
movl %eax, (%esp)
call func1
movl %eax, 24(%esp)
movl $.LC0, %eax
movl 24(%esp), %edx
movl %edx, 8(%esp)
movl 28(%esp), %edx
movl %edx, 4(%esp)
movl %eax, (%esp)
call printf
movl $0, %eax
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5"
.section .note.GNU-stack,"",#progbits

The andl $-16, %esp aligns the stack pointer to a multiple of 16 bytes, by clearing the low four bits.
The only places where positive offsets are used with (%ebp) are parameter accesses.
You did not state what your target platform is or what switches you used to compile with. The assembly code shows some Ubuntu identifier has been inserted, but I am not familiar with the ABI it uses, beyond that it is probably similar to ABIs generally used with the Intel x86 architecture. So I am going to guess that the ABI requires 8-byte alignment at routine calls, and so the compiler makes the stack frame of func1 24 bytes instead of 20 so that 8-byte alignment is maintained.
I will further guess that the compiler aligned the stack to 16 bytes at the start of main as a sort of “preference” in the compiler, in case it uses SSE instructions that prefer 16-byte alignment, or other operations that prefer 16-byte alignment.
So, we have:
In main, the andl $-16, %esp aligns the stack to a multiple of 16 bytes as a compiler preference. Inside main, 28(%esp) and 24(%esp) refer to temporary values the compiler saves on the stack, while 8(%esp), 4(%esp), and (%esp) are used to pass parameters to func1 and printf. We see from the fact that the assembly code calls printf but it is commented out in your code that you have pasted C source code that is different from the C source code used to generate the assembly code: This is not the correct assembly code generated from the C source code.
In func1, 24 bytes are allocated on the stack instead of 20 to maintain 8-byte alignment. Inside func1, parameters are accessed through 8(%ebp) and 4(%ebp). Locations from -12(%ebp) to -4(%ebp) are used to hold values of your variables. 4(%esp) and (%esp) are used to pass parameters to add.
Here is the stack frame of func1:
- 4(%ebp) = 20(%esp): s1.
- 8(%ebp) = 16(%esp): s2.
-12(%ebp) = 12(%esp): s3.
-16(%ebp) = 8(%esp): Unused padding.
-20(%ebp) = 4(%esp): Passes second parameter of add.
-24(%ebp) = 0(%esp): Passes first parameter of add.

I would suggest working through this with the output of objdump -S which will give you interlisting with the C source.

i have a question about stack frame structure?

i'm a new one to learn assembly. i write a c file:
#include <stdlib.h>
int max( int c )
{
int d;
d = c + 1;
return d;
}
int main( void )
{
int a = 0;
int b;
b = max( a );
return 0;
}
and i use gcc -S as01.c and create a assembly file.
.file "as01.c"
.text
.globl max
.type max, #function
max:
pushl %ebp
movl %esp, %ebp
subl $32, %esp
movl $0, -4(%ebp)
movl $1, -24(%ebp)
movl $2, -20(%ebp)
movl $3, -16(%ebp)
movl $4, -12(%ebp)
movl $6, -8(%ebp)
movl 8(%ebp), %eax
addl $1, %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
leave
ret
.size max, .-max
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
subl $20, %esp
movl $0, -4(%ebp)
movl -4(%ebp), %eax
movl %eax, (%esp)
call max
movl %eax, -8(%ebp)
"as01.s" 38L, 638C
i' confused, beacuse movl %eax, -4(%ebp) movl -4(%ebp), %eax in max(),
i know that %eax is used for returning the value of any function.
I think %eax is a temporarily register for store the c + 1.
This is right?
thank you for your answer.

You don't have optimisation turned on, so the compiler is generating really bad code. The primary storage for all your values is in the stack frame, and values are loaded into registers only long enough to do the calculations.
The code actually breaks down into:
pushl %ebp
movl %esp, %ebp
subl $32, %esp
Standard function prologue, setting up a new stack frame, and reserving 50 bytes for the stack frame.
movl $0, -4(%ebp)
movl $1, -24(%ebp)
movl $2, -20(%ebp)
movl $3, -16(%ebp)
movl $4, -12(%ebp)
movl $6, -8(%ebp)
Fill the stack frame with dummy values (presumably as a debugging aid).
movl 8(%ebp), %eax
addl $1, %eax
movl %eax, -4(%ebp)
Read the parameter c out of the stack frame, add one to it, store it into a (different) stack slot.
movl -4(%ebp), %eax
leave
ret
Read the value back out of the stack slot and return it.
If you compile this with optimisation, you'll see most of the code vanish. If you use -fomit-frame-pointer -Os, you should end up with this:
max:
movl 4(%esp), %eax
incl %eax
ret

movl %eax, -4(%ebp)
Here the value computed for d (now stored in eax) is saved in d's memory cell.
movl -4(%ebp), %eax
While here the return value (d's) gets loaded into eax, because, as you know, eax holds functions' return value.
As #David said, you're compiling without optimization, so gcc generates easy-to-debug code, which is quite inefficient and repetitive sometimes.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Variable swap with and without auxiliary variable - which is faster? - c

Related

Translating assembly

Assembly Code from C program

Why does GCC produce the following asm output?

assembly code of the c function

i have a question about stack frame structure?

Categories

Resources