I've been trying to translate this function to assembly:
void foo (int a[], int n) {
int i;
int s = 0;
for (i=0; i<n; i++) {
s += a[i];
if (a[i] == 0) {
a[i] = s;
s = 0;
}
}
}
But something is going wrong.
That's what I've done so far:
.section .text
.globl foo
foo:
.L1:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $0, -16(%rbp) /*s*/
movl $0, -8(%rbp) /*i*/
jmp .L2
.L2:
cmpl -8(%rbp), %esi
jle .L4
leave
ret
.L3:
addl $1, -8(%rbp)
jmp .L2
.L4:
movl -8(%rbp), %eax
imull $4, %eax
movslq %eax, %rax
addq %rdi, %rax
movl (%rax), %eax
addl %eax, -16(%rbp)
cmpl $0, %eax
jne .L3
/* if */
leaq (%rax), %rdx
movl -16(%rbp), %eax
movl %eax, (%rdx)
movl $0, -16(%rbp)
jmp .L3
I am compiling the .s module with a .c module, for example, with an int nums [5] = {65, 23, 11, 0, 34} and I'm getting back the same array instead of {65, 23, 11 , 99, 34}.
Could someone help me?
Presumably you have a compiler that can generate AT&T syntax. It might be more instructive to look at what assembly output the compiler generates. Here's my re-formulation of your demo:
#include <stdio.h>
void foo (int a[], int n)
{
for (int s = 0, i = 0; i < n; i++)
{
if (a[i] != 0)
s += a[i];
else
a[i] = s, s = 0;
}
}
int main (void)
{
int nums[] = {65, 23, 11, 0, 34};
int size = sizeof(nums) / sizeof(int);
foo(nums, size);
for (int i = 0; i < size; i++)
fprintf(stdout, i < (size - 1) ? "%d, " : "%d\n", nums[i]);
return (0);
}
Compiling without optimizations enabled is typically harder to work through than optimized code, since it loads from and spills results to memory. You won't learn much from it if you're investing time in learning how to write efficient assembly.
Compiling with the Godbolt compiler explorer with -O2 optimizations yields much more efficient code; it's also useful for cutting out unnecessary directives, labels, etc., that would be visual noise in this case.
In my experience, using -O2 optimizations are clever enough to make you rethink your use of registers, refactoring, etc. -O3 can sometimes optimize too agressively - unrolling loops, vectorizing, etc., to easily follow.
Finally, for the case you have presented, there's a perfect compromise: -Os, which enables many of the optimizations of -O2, but not at the expense of increased code size. I'll paste the assembly here just for comparative purposes:
foo:
xorl %eax, %eax
xorl %ecx, %ecx
.L2:
cmpl %eax, %esi
jle .L7
movl (%rdi,%rax,4), %edx
testl %edx, %edx
je .L3
addl %ecx, %edx
jmp .L4
.L3:
movl %ecx, (%rdi,%rax,4)
.L4:
incq %rax
movl %edx, %ecx
jmp .L2
.L7:
ret
Remember that the calling convention passes the pointer to (a) in %rdi, and the 'count' (n) in %rsi. These are the calling conventions being used. Notice that your code does not 'dereference' or 'index' any elements through %rdi. It's definitely worth going stepping through the code - even with pen and paper if it helps - to understand the branch conditions and how reading and writing is performed on element a[i].
Curiously, using the inner loop of your code:
s += a[i];
if (a[i] == 0)
a[i] = s, s = 0;
Appears to generate more efficient code with -Os than the inner loop I used:
foo:
xorl %eax, %eax
xorl %edx, %edx
.L2:
cmpl %eax, %esi
jle .L6
movl (%rdi,%rax,4), %ecx
addl %ecx, %edx
testl %ecx, %ecx
jne .L3
movl %edx, (%rdi,%rax,4)
xorl %edx, %edx
.L3:
incq %rax
jmp .L2
.L6:
ret
A reminder for me to keep things simple!
Related
I am looking for a fast modulo 10 algorithm because I need to speed up my program which does many modulo operations in cycles.
I have checked out this page which compares some alternatives.
As far as I understand it correctly, T3 was the fastest of all.
My question is, how would x % y look like using T3 technique?
I copied T3 technique here for simplicity in case the link gets down.
for (int x = 0; x < max; x++)
{
if (y > (threshold - 1))
{
y = 0; //reset
total += x;
}
y += 1;
}
Regarding to comments, if this is not really faster then regular mod, I am looking for at least 2 times faster modulo than using %.
I have seen many examples with use power of two, but since 10 is not, how can I get it to work?
Edit:
For my program, let's say I have 2 for cycles where n=1 000 000 and m=1000.
Looks like this:
for (i = 1; i <= n; i++) {
D[(i%10)*m] = i;
for (j = 1; j <= m; j++) {
...
}
}
Here's the fastest modulo-10 function you can write:
unsigned mod10(unsigned x)
{
return x % 10;
}
And here's what it looks like once compiled:
movsxd rax, edi
imul rcx, rax, 1717986919
mov rdx, rcx
shr rdx, 63
sar rcx, 34
add ecx, edx
add ecx, ecx
lea ecx, [rcx + 4*rcx]
sub eax, ecx
ret
Note the lack of division/modulus instructions, the mysterious constants, the use of an instruction which was originally intended for complex array indexing, etc. Needless to say, the compiler knows a lot of tricks to make your program as fast as possible. You'll rarely beat it on tasks like this.
You likely can't beat the compiler.
Debug build
// int foo = x % 10;
010341C5 mov eax,dword ptr [x]
010341C8 cdq
010341C9 mov ecx,0Ah
010341CE idiv eax,ecx
010341D0 mov dword ptr [foo],edx
Retail build (doing some ninja math there...)
// int foo = x % 10;
00BD100E mov eax,66666667h
00BD1013 imul esi
00BD1015 sar edx,2
00BD1018 mov ecx,edx
00BD101A shr ecx,1Fh
00BD101D add ecx,edx
00BD101F lea eax,[ecx+ecx*4]
00BD1022 add eax,eax
00BD1024 sub esi,eax
The code isn’t a direct substitute for modulo, it substitutes modulo in that situation. You can write your own mod by analogy (for a, b > 0):
int mod(int a, int b) {
while (a >= b) a -= b;
return a;
}
… but whether that’s faster than % is highly questionable.
This will work for (multiword) values larger than the machineword (but assuming a binary computer ...):
#include <stdio.h>
unsigned long mod10(unsigned long val)
{
unsigned res=0;
res =val &0xf;
while (res>=10) { res -= 10; }
for(val >>= 4; val; val >>= 4){
res += 6 * (val&0xf);
while (res >= 10) { res -= 10; }
}
return res;
}
int main (int argc, char **argv)
{
unsigned long val;
unsigned res;
sscanf(argv[1], "%lu", &val);
res = mod10(val);
printf("%lu -->%u\n", val,res);
return 0;
}
UPDATE:
With some extra effort, you could get the algoritm free of multiplications, and with the proper amount of optimisation we can even get the recursive call inlined:
static unsigned long mod10_1(unsigned long val)
{
unsigned char res=0; //just to show that we don't need a big accumulator
res =val &0xf; // res can never be > 15
if (res>=10) { res -= 10; }
for(val >>= 4; val; val >>= 4){
res += (val&0xf)<<2 | (val&0xf) <<1;
res= mod10_1(res); // the recursive call
}
return res;
}
And the result for mod10_1 appears to be mul/div free and almost without branches:
mod10_1:
.LFB25:
.cfi_startproc
movl %edi, %eax
andl $15, %eax
leal -10(%rax), %edx
cmpb $10, %al
cmovnb %edx, %eax
movq %rdi, %rdx
shrq $4, %rdx
testq %rdx, %rdx
je .L12
pushq %r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
pushq %rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
pushq %rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
.L4:
movl %edx, %ecx
andl $15, %ecx
leal (%rcx,%rcx,2), %ecx
leal (%rax,%rcx,2), %eax
movl %eax, %ecx
movzbl %al, %esi
andl $15, %ecx
leal -10(%rcx), %r9d
cmpb $9, %cl
cmovbe %ecx, %r9d
shrq $4, %rsi
leal (%rsi,%rsi,2), %ecx
leal (%r9,%rcx,2), %ecx
movl %ecx, %edi
movzbl %cl, %ecx
andl $15, %edi
testq %rsi, %rsi
setne %r10b
cmpb $9, %dil
leal -10(%rdi), %eax
seta %sil
testb %r10b, %sil
cmove %edi, %eax
shrq $4, %rcx
andl $1, %r10d
leal (%rcx,%rcx,2), %r8d
movl %r10d, %r11d
leal (%rax,%r8,2), %r8d
movl %r8d, %edi
andl $15, %edi
testq %rcx, %rcx
setne %sil
leal -10(%rdi), %ecx
andl %esi, %r11d
cmpb $9, %dil
seta %bl
testb %r11b, %bl
cmovne %ecx, %edi
andl $1, %r11d
andl $240, %r8d
leal 6(%rdi), %ebx
setne %cl
movl %r11d, %r8d
andl %ecx, %r8d
leal -4(%rdi), %ebp
cmpb $9, %bl
seta %r12b
testb %r8b, %r12b
cmovne %ebp, %ebx
andl $1, %r8d
cmovne %ebx, %edi
xorl $1, %ecx
andl %r11d, %ecx
orb %r8b, %cl
cmovne %edi, %eax
xorl $1, %esi
andl %r10d, %esi
orb %sil, %cl
cmove %r9d, %eax
shrq $4, %rdx
testq %rdx, %rdx
jne .L4
popq %rbx
.cfi_restore 3
.cfi_def_cfa_offset 24
popq %rbp
.cfi_restore 6
.cfi_def_cfa_offset 16
movzbl %al, %eax
popq %r12
.cfi_restore 12
.cfi_def_cfa_offset 8
ret
.L12:
movzbl %al, %eax
ret
.cfi_endproc
.LFE25:
.size mod10_1, .-mod10_1
.p2align 4,,15
.globl mod10
.type mod10, #function
I'm trying to understand how a compiler will optimize two if statements that return the same value. Consider the following code at the top of a function:
if (some_ptr == NULL) {
return -1;
}
if (some_other_ptr == NULL) {
return -1;
}
Will the two if statements be combined into one check that is equivalent to:
if (some_ptr == NULL || some_other_ptr == NULL) {
return -1;
}
While the comments emphasize that this behavior is compiler implementation dependent, a look at a specific compiler is helpful in understanding this.
Using the test program:
int main(int argc, char *argv[]) {
srand(time(NULL));
char *some_ptr = (char *) rand();
char *some_other_ptr = (char *) rand();
if (some_ptr == NULL) {
return -1;
}
if (some_other_ptr == NULL) {
return -1;
}
return 0;
}
Using clang on my laptop running OS X, with no optimization (-O0 flag) the assembly output follows the input code closely with no shortcuts.
movslq %eax, %rcx
movq %rcx, -32(%rbp)
cmpq $0, -24(%rbp)
jne LBB0_2
## BB#1:
movl $-1, -4(%rbp)
jmp LBB0_5
LBB0_2:
cmpq $0, -32(%rbp)
jne LBB0_4
## BB#3:
movl $-1, -4(%rbp)
jmp LBB0_5
LBB0_4:
movl $0, -4(%rbp)
LBB0_5:
movl -4(%rbp), %eax
addq $32, %rsp
popq %rbp
retq
.cfi_endproc
But compiling with the highest optimization (-O3 flag) results in some different code.
movl %eax, %ecx
movl $-1, %eax
testl %ebx, %ebx
je LBB0_2
## BB#1:
cmpl $1, %ecx
sbbl %eax, %eax
LBB0_2:
addq $8, %rsp
popq %rbx
popq %rbp
retq
.cfi_endprocemphasized text
In either case, in the case of my version of clang, the compiler does not ever or the two boolean results together, and even in the optimized code, two comparisons are being made via the testl and cmpl instructions.
You could write a compiler that has this behavior if you wanted to!
In a bytecode interpreting loop, after several tests, I'm surprised to see that using switch is the worst choice to make. Making calls to a function pointer array, or using gcc's computed goto extension is always 10~20% faster, the computed goto version being the fastest. I've tested with my real toy VM with 97 instructions and with the mini fake VM pasted below.
Why is using switch the slowest?
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <inttypes.h>
#include <time.h>
enum {
ADD1 = 1,
ADD2,
SUB3,
SUB4,
MUL5,
MUL6,
};
static unsigned int number;
static void add1(void) {
number += 1;
}
static void add2(void) {
number += 2;
}
static void sub3(void) {
number -= 3;
}
static void sub4(void) {
number -= 4;
}
static void mul5(void) {
number *= 5;
}
static void mul6(void) {
number *= 6;
}
static void interpret_bytecodes_switch(uint8_t *bcs) {
while (true) {
switch (*bcs++) {
case 0:
return;
case ADD1:
add1();
break;
case ADD2:
add2();
break;
case SUB3:
sub3();
break;
case SUB4:
sub4();
break;
case MUL5:
mul5();
break;
case MUL6:
mul6();
break;
}
}
}
static void interpret_bytecodes_function_pointer(uint8_t *bcs) {
void (*fs[])(void) = {
NULL,
add1,
add2,
sub3,
sub4,
mul5,
mul6,
};
while (*bcs) {
fs[*bcs++]();
}
}
static void interpret_bytecodes_goto(uint8_t *bcs) {
void *labels[] = {
&&l_exit,
&&l_add1,
&&l_add2,
&&l_sub3,
&&l_sub4,
&&l_mul5,
&&l_mul6,
};
#define JUMP goto *labels[*bcs++]
JUMP;
l_exit:
return;
l_add1:
add1();
JUMP;
l_add2:
add2();
JUMP;
l_sub3:
sub3();
JUMP;
l_sub4:
sub4();
JUMP;
l_mul5:
mul5();
JUMP;
l_mul6:
mul6();
JUMP;
#undef JUMP
}
struct timer {
clock_t start, end;
};
static void timer_start(struct timer *self) {
self->start = clock();
}
static void timer_end(struct timer *self) {
self->end = clock();
}
static double timer_measure(struct timer *self) {
return (double)(self->end - self->start) / CLOCKS_PER_SEC;
}
static void test(void (*f)(uint8_t *), uint8_t *bcs) {
number = 0;
struct timer timer;
timer_start(&timer);
f(bcs);
timer_end(&timer);
printf("%u %.3fs\n", number, timer_measure(&timer));
}
int main(void) {
const int N = 300000000;
srand((unsigned)time(NULL));
uint8_t *bcs = malloc(N + 1);
for (int i = 0; i < N; ++i) {
bcs[i] = rand() % 5 + 1;
}
bcs[N] = 0;
for (int i = 0; i < 10; ++i) {
printf("%d ", bcs[i]);
}
printf("\nswitch\n");
test(interpret_bytecodes_switch, bcs);
printf("function pointer\n");
test(interpret_bytecodes_function_pointer, bcs);
printf("goto\n");
test(interpret_bytecodes_goto, bcs);
return 0;
}
result
~$ gcc vm.c -ovm -std=gnu11 -O3
~$ ./vm
3 4 5 3 3 5 3 3 1 2
switch
3050839589 2.866s
function pointer
3050839589 2.573s
goto
3050839589 2.433s
~$ ./vm
3 1 1 3 5 5 2 4 5 1
switch
3898179583 2.871s
function pointer
3898179583 2.573s
goto
3898179583 2.431s
~$ ./vm
5 5 1 2 3 3 1 2 2 4
switch
954521520 2.869s
function pointer
954521520 2.574s
goto
954521520 2.432s
Below is the relevant disassembly of the code posted here after -O3 optimization.
interpret_bytecodes_switch:
.L8:
addq $1, %rdi
cmpb $6, -1(%rdi)
ja .L8
movzbl -1(%rdi), %edx
jmp *.L11(,%rdx,8)
.L11:
.quad .L10
.quad .L12
.quad .L13
.quad .L14
.quad .L15
.quad .L16
.quad .L17
.L16:
leal (%rax,%rax,4), %eax
jmp .L8
.L15:
subl $4, %eax
jmp .L8
.L14:
subl $3, %eax
jmp .L8
.L13:
addl $2, %eax
jmp .L8
.L12:
addl $1, %eax
jmp .L8
.L10:
movl %eax, number(%rip)
ret
.L17:
leal (%rax,%rax,2), %eax
addl %eax, %eax
jmp .L8
interpret_bytecodes_function_pointer:
pushq %rbx
movq %rdi, %rbx
subq $64, %rsp
movzbl (%rdi), %eax
movq $0, (%rsp)
movq $add1, 8(%rsp)
movq $add2, 16(%rsp)
movq $sub3, 24(%rsp)
movq $sub4, 32(%rsp)
movq $mul5, 40(%rsp)
testb %al, %al
movq $mul6, 48(%rsp)
je .L19
.L23:
addq $1, %rbx
call *(%rsp,%rax,8)
movzbl (%rbx), %eax
testb %al, %al
jne .L23
.L19:
addq $64, %rsp
popq %rbx
ret
interpret_bytecodes_goto:
movzbl (%rdi), %eax
movq $.L27, -72(%rsp)
addq $2, %rdi
movq $.L28, -64(%rsp)
movq $.L29, -56(%rsp)
movq $.L30, -48(%rsp)
movq $.L31, -40(%rsp)
movq $.L32, -32(%rsp)
movq $.L33, -24(%rsp)
movq -72(%rsp,%rax,8), %rax
jmp *%rax
.L33:
movl number(%rip), %eax
leal (%rax,%rax,2), %eax
addl %eax, %eax
movl %eax, number(%rip)
movzbl -1(%rdi), %eax
movq -72(%rsp,%rax,8), %rax
.L35:
addq $1, %rdi
jmp *%rax
.L28:
addl $1, number(%rip)
movzbl -1(%rdi), %eax
movq -72(%rsp,%rax,8), %rax
jmp .L35
.L30:
subl $3, number(%rip)
movzbl -1(%rdi), %eax
movq -72(%rsp,%rax,8), %rax
jmp .L35
.L31:
subl $4, number(%rip)
movzbl -1(%rdi), %eax
movq -72(%rsp,%rax,8), %rax
jmp .L35
.L32:
movl number(%rip), %eax
leal (%rax,%rax,4), %eax
movl %eax, number(%rip)
movzbl -1(%rdi), %eax
movq -72(%rsp,%rax,8), %rax
jmp .L35
.L29:
addl $2, number(%rip)
movzbl -1(%rdi), %eax
movq -72(%rsp,%rax,8), %rax
jmp .L35
.L27:
rep ret
switch is slowest because it has to manage default cases and this may add an extra bounds test you didn't implemented.
switch also handles a more general case where case values are not arranged in a so simple sequence, extra computation may be needed for that.
I was in the middle of writing a long answer when you posted the assembly code...
Basically, the goto version uses more "code" to prevent a few (or a single) instructions in each iteration. It's similar to a size vs. speed optimization.
Since your "real work" is negligible, it makes enough of a difference in the benchmark, but in a real world scenario that instruction will become negligible.
You are performing a micro benchmark. Micro benchmarks on modern CPUs can be affected by all kinds of random or unpredicatable effects. There is actually very little difference in execution time. However, in order to make the code comparable, you combined switch and function calls, which in real life wouldn't happen for time critical code.
Recent times I am having a look at assembly IA32, I did a simple toy example:
#include <stdio.h>
int array[10];
int i = 0;
int sum = 0;
int main(void)
{
for (i = 0; i < 10; i++)
{
array[i] = i;
sum += array[i];
}
printf("SUM = %d\n",sum);
return 0;
}
Yes, I know it is not very recommended to use global variables. I compiled the above code without optimizations and using the flag -s, I got this assembly :
main:
...
movl $0, %eax
subl %eax, %esp
movl $0, i
.L2:
cmpl $9, i
jle .L5
jmp .L3
.L5:
movl i, %edx
movl i, %eax
movl %eax, array(,%edx,4)
movl i, %eax
movl array(,%eax,4), %eax
addl %eax, sum
incl i
jmp .L2
Nothing too fancy and easy to understand, it is a normal while loop. Then I compiled the same code with -O2 and got the following assembly :
main:
...
xorl %eax, %eax
movl $0, i
movl $1, %edx
.p2align 2,,3
.L6:
movl sum, %ecx
addl %eax, %ecx
movl %eax, array-4(,%edx,4)
movl %edx, %eax
incl %edx
cmpl $9, %eax
movl %ecx, sum
movl %eax, i
jle .L6
subl $8, %esp
pushl %ecx
pushl $.LC0
call printf
xorl %eax, %eax
leave
ret
In this case it transformed in a do while type of loop. From the above assembly what I am not understanding is why "movl $1, %edx" and then "movl %eax, array-4(,%edx,4)".
The %edx starts with 1 instead of 0 and then when accessing the array it does -4 from the initial position (4 bytes = integer) . Why not simply ?
movl $0, %edx
...
array (,%edx,4)
instead of starting with 1 if you need to do -4 all the time.
I am using "GCC: (GNU) 3.2.3 20030502 (Red Hat Linux 3.2.3-24)", for educational reasons to generate easily understandable assembly.
I think I finally get the point, I test with:
...
int main(void)
{
for (i = 0; i < 10; i+=2)
{
...
}
}
and got:
movl $2, %edx
and with for (i = 0; i < 10; i +=3) and got :
movl $3, %edx
and finally with (i = 1; i < 10; i +=3) and got:
movl $4, %edx
Therefore, the compiler is initializing %edx = i (initial value of i) + incrementStep;
I guess you all heard of the 'swap problem'; SO is full of questions about it.
The version of the swap without use of a third variable is often considered to be faster since, well, you have one variable less. I wanted to know what was going on behind the curtains and wrote the following two programs:
int main () {
int a = 9;
int b = 5;
int swap;
swap = a;
a = b;
b = swap;
return 0;
}
and the version without third variable:
int main () {
int a = 9;
int b = 5;
a ^= b;
b ^= a;
a ^= b;
return 0;
}
I generated the assembly code using clang and got this for the first version (that uses a third variable):
...
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl $0, %eax
movl $0, -4(%rbp)
movl $9, -8(%rbp)
movl $5, -12(%rbp)
movl -8(%rbp), %ecx
movl %ecx, -16(%rbp)
movl -12(%rbp), %ecx
movl %ecx, -8(%rbp)
movl -16(%rbp), %ecx
movl %ecx, -12(%rbp)
popq %rbp
ret
Leh_func_end0:
...
and this for the second version (that does not use a third variable):
...
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl $0, %eax
movl $0, -4(%rbp)
movl $9, -8(%rbp)
movl $5, -12(%rbp)
movl -12(%rbp), %ecx
movl -8(%rbp), %edx
xorl %ecx, %edx
movl %edx, -8(%rbp)
movl -8(%rbp), %ecx
movl -12(%rbp), %edx
xorl %ecx, %edx
movl %edx, -12(%rbp)
movl -12(%rbp), %ecx
movl -8(%rbp), %edx
xorl %ecx, %edx
movl %edx, -8(%rbp)
popq %rbp
ret
Leh_func_end0:
...
The second one is longer but I don't know much about assembly code so I have no idea if that means that it is slower so I'd like to hear the opinion of someone more knowledgable about it.
Which of the above versions of a variable swap is faster and takes less memory?
Look at some optimised assembly. From
void swap_temp(int *restrict a, int *restrict b){
int temp = *a;
*a = *b;
*b = temp;
}
void swap_xor(int *restrict a, int *restrict b){
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
gcc -O3 -std=c99 -S -o swapping.s swapping.c produced
.file "swapping.c"
.text
.p2align 4,,15
.globl swap_temp
.type swap_temp, #function
swap_temp:
.LFB0:
.cfi_startproc
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
ret
.cfi_endproc
.LFE0:
.size swap_temp, .-swap_temp
.p2align 4,,15
.globl swap_xor
.type swap_xor, #function
swap_xor:
.LFB1:
.cfi_startproc
movl (%rsi), %edx
movl (%rdi), %eax
xorl %edx, %eax
xorl %eax, %edx
xorl %edx, %eax
movl %edx, (%rsi)
movl %eax, (%rdi)
ret
.cfi_endproc
.LFE1:
.size swap_xor, .-swap_xor
.ident "GCC: (SUSE Linux) 4.5.1 20101208 [gcc-4_5-branch revision 167585]"
.section .comment.SUSE.OPTs,"MS",#progbits,1
.string "Ospwg"
.section .note.GNU-stack,"",#progbits
To me, swap_temp looks as efficient as can be.
The problem with XOR swap trick is that it's strictly sequential. It may seem deceptively fast, but in reality, it is not. There's an instruction called XCHG that swaps two registers, but this can also be slower than simply using 3 MOVs, due to its atomic nature. The common technique with temp is an excellent choice ;)
To get an idea of the cost imagine that every command has a cost to be performed and also the indirect addressing has its own cost.
movl -12(%rbp), %ecx
This line will need something like a time unit for accessing the value in ecx register,
one time unit for accessing rbp, another one for applying the offset (-12) and more time
units (let's say arbitrarily 3) for moving the value from the address stored in ecx to the
address indicated from -12(%rbp).
If you count all the operations in every line and all line, the second method is for sure costlier than the first one.