How 'smart' is GCC's Tail-Call-Optimisation? - c

I just had a discussion where the following two peices of C code were being discussed:
For-Loop:
#include <stdio.h>
#define n (196607)
int main() {
long loop;
int count=0;
for (loop=0;loop<n;loop++) {
count++;
}
printf("Result = %d\n",count);
return 0;
}
Recursive:
#include <stdio.h>
#define n (196607)
long recursive(long loop) {
return (loop>0) ? recursive(loop-1)+1: 0;
}
int main() {
long result;
result = recursive(n);
printf("Result = %d\n",result);
return 0;
}
On seeing this code, I saw recursive(loop-1)+1 and thought "ah, that's not tail call recursive" because it has work to do after the call to recursive is complete; it needs to increment the return value.
Sure enough, with no optimisation, the recursive code triggers a stack overflow, as you would expect.
with the -O2 flag however, the stack overflow is not encountered, which I take to mean that the stack is reused, rather than pushing more and more onto the stack - which is tco.
GCC can obviously detect this simple case (+1 to return value) and optimise it out, but how far does it go?
What are the limits to what gcc can optimise with tco, when the recursive call isn't the last operation to be performed?
Addendum:
I've written a fully tail recursive return function(); version of the code.
Wrapping the above in a loop with 9999999 iterations, I came up with the following timings:
$ for f in *.exe; do time ./$f > results; done
+ for f in '*.exe'
+ ./forLoop.c.exe
real 0m3.650s
user 0m3.588s
sys 0m0.061s
+ for f in '*.exe'
+ ./recursive.c.exe
real 0m3.682s
user 0m3.588s
sys 0m0.093s
+ for f in '*.exe'
+ ./tail_recursive.c.exe
real 0m3.697s
user 0m3.588s
sys 0m0.077s
so a (admittedly simple and not very rigorous) benchmark shows that it does indeed seem to be in the same order of time taken.

Just disassemble the code and see what happened. Without optimizations, I get this:
0x0040150B cmpl $0x0,0x10(%rbp)
0x0040150F jle 0x401523 <recursive+35>
0x00401511 mov 0x10(%rbp),%eax
0x00401514 sub $0x1,%eax
0x00401517 mov %eax,%ecx
0x00401519 callq 0x401500 <recursive>
But with -O1, -O2 or -O3 I get this:
0x00402D09 mov $0x2ffff,%edx
This doesn't have anything to do with tail optimizations, but much more radical optimizations. The compiler simply inlined the whole function and pre-calculated the result.
This is likely why you end up with the same result in all your different cases of benchmarking.

Related

C - efficiently changing a function pointer based on command line input

I have several similar functions, say A, B, C. I want to choose one of them with command line options. Also, I'm calling that function billion times because of that instead of checking a variable inside a function billion times, I'm defining a function pointer Phi and set it to desired function just one time. But when I set, Phi = A, (so no user input considered) my code runs in ~24 secs, when I add an if-else and set Phi to desired function, my code runs in ~30 secs with exact same parameters. (Of course command line option sets Phi to A) What is the efficient way to handle this case?
My functions:
double funcA(double r)
{
return 0;
}
double funcB(double r)
{
return 1;
}
double funcC(double r)
{
return r;
}
void computationFunctionFast(Context *userInputs) {
double (*Phi)(double) = funcA;
/* computation codes */
}
void computationFunctionSlow(Context *userInputs) {
double (*Phi)(double);
switch (userInputs->funcEnum) {
case A:
Phi = funcA;
break;
case B:
Phi = funcB;
break;
case C:
Phi = funcC;
}
/* computation codes */
}
I've tried gcc, clang, icx with -O2 and -O3 optimizations. (gcc has no performance difference in mentioned cases but has the worst performance) Although I'm using C, I've tried std::function too. I've tried defining Phi function in different scopes etc.
Generally, there are a few things here that are slightly bad for performance:
Branches/comparisons lead to inefficient use of branch prediction/instruction cache and might affect pipelining too.
Function pointers are notoriously inefficient since they generally block inlining and generally the compiler can't do much about them.
Here's an example based on your code:
double computationFunctionSlow (int input, double val) {
double (*Phi)(double);
switch (input) {
case 0: Phi = funcA; break;
case 1: Phi = funcB; break;
case 2: Phi = funcC; break;
}
double res = Phi(val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
cmp edi, 2
ja .LBB3_1
movsxd rax, edi
lea rcx, [rip + .Lswitch.table.computationFunctionSlow]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.LBB3_1:
xorps xmm0, xmm0
ret
.Lswitch.table.computationFunctionSlow:
.quad funcA
.quad funcB
.quad funcC
Even though the numbers I picked are adjacent, the usual compilers fail to optimize out the comparison cmp. Even when I include a default: return 0; it is still there. You can quite easily manually optimize any switch with contiguous indices like this into a function pointer jump table:
double computationFunctionSlow (int input, double val) {
double (*Phi[3])(double) = {funcA, funcB, funcC};
double res = Phi[input](val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
movsxd rax, edi
lea rcx, [rip + .L__const.computationFunctionSlow.Phi]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.L__const.computationFunctionSlow.Phi:
.quad funcA
.quad funcB
.quad funcC
This leads to slightly better code here as the comparison instruction/branch is now removed. However, this is really a micro optimization that shouldn't have that much impact of performance. You have to benchmark it for sure to see if there's any improvement.
(Also gcc 12.2 didn't optimize this code as good, why I went with clang for this example.)
Godbolt link: https://godbolt.org/z/ja4zerj7o
There isn't a more "efficient" way to handle this case, you are already doing what you should.
The difference in timing you observe is because:
In the first case (Phi = funcA) the compiler knows the function will always be the same and is therefore able to optimize its calls. Depending on what your "computation code" does, this could mean inlining the function and simplifying a lot of calculations for you.
In the second case (Phi = <choice from user>) the compiler cannot know which function will be selected, and therefore cannot optimize any of the calls made to it by the rest of the code. It also cannot propagate optimizations to other parts of your "computation code" like in the first case.
In general, there isn't much you can do. Dynamic function pointers inherently add a bit of runtime overhead and make optimizations harder (or impossible).
What you could try is duplicating the "computation code" inside different functions or different branches that you only enter after asserting that Phi is equal to a constant, like so:
void computationFunctionSlow(Context *userInputs) {
if (userInputs->funcEnum == A) {
const double (*Phi)(double) = funcA;
// computation code
} else if (...) {
// ...
}
}
In the above piece of code, the compiler knows that inside any of those if blocks the value of Phi can only have one value, and could therefore be able to perform the same optimizations discussed in point 1 above.
There's no need to put an enum in your userInputs when all you do with it is use it to select a function pointer. Just add the function pointer in the structure directly and eliminate the branching done on every call.
Instead of
struct Context
{
.
.
.
enum funcType funcEnum;
};
use
struct Context
{
.
.
.
double (*phi)(double);
};
You'd wind up with something like this:
void computationFunctionSlow(Context *userInputs) {
/* computation codes */
double result = userInputs->phi( data );
}

will gcc optimization remove for loop if it's only one iteration?

Im writing a real time DSP processing library.
My intention is to give it a flexibility to define input samples blockSize, while also having best possible performance in case of sample-by-sample processing, that is - single sample block size
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
This leads me to a question:
Will gcc compiler optimize this code
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo()
}
or
//.h
#define BLOCKSIZE 1
//.c
for (volatile int i=0; i<BLOCKSIZE; i++)
{
foo()
}
to be same as simply calling body of the loop:
foo()
?
Thx
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
No, that doesn't make any sense. Only the input/output hardware registers themselves should be volatile. Pointers to them should be declared as pointer-to-volatile data, ie volatile uint8_t*. There is no need to make the pointer itself volatile, ie uint8_t* volatile //wrong.
As things stand now, you force the compiler to create a variable i and increase it, which will likely block loop unrolling optimizations.
Trying your code on gcc x86 with -O3 this is exactly what happens. No matter the size of BLOCKSIZE, it still generates the loop because of volatile. If I drop volatile it completely unrolls the loop up to BLOCKSIZE == 7 and replace it with a number of function calls. Beyond 8 it creates a loop (but keeps the iterator in a register instead of RAM).
x86 example:
for (int i=0; i<5; i++)
{
foo();
}
gives
call foo
call foo
call foo
call foo
call foo
But
for (volatile int i=0; i<5; i++)
{
foo();
}
gives way more inefficient
mov DWORD PTR [rsp+12], 0
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jg .L2
.L3:
call foo
mov eax, DWORD PTR [rsp+12]
add eax, 1
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jle .L3
.L2:
For further study of the correct use of volatile in embedded systems, please see:
How to access a hardware register from firmware?
Using volatile in embedded C development
Since the loop variable is volatile it shouldn't optimize it. The compiler can not know wether i will be 1 when the condition is evaluated, so it has to keep the loop.
From the compiler point of view, the loop can run an indeterminite number of times until the condition is satisfied.
If you somehwere access hardware registers, then those should be declared volatile, which would make more sense, to the reader, and also allows the compiler to apply appropriate optimizations where possible.
volatile keyword says the compiler that the variable is side effects prone - ie it can be changed by something which is not visible for the compiler.
Because of that volatile variables have to read before every use and saved to their permanent storage location after every modification.
In your example the loop cannot be optimized as variable i can be changed during the loop (for example some interrupt routine will change it to zero so the loop will have to be executed again.
The answer to your question is: If the compiler can determine that every time you enter the loop, it will execute only once, then it can eliminate the loop.
Normally, the optimization phase unrolls the loops, based on how the iterations relate to one another, this makes your (e.g. indefinite) loop to get several times bigger, in exchange to avoid the back loops (that normally result in a bubble in the pipeline, depending on the cpu type) but not too much to lose cache hits.... so it is a bit complicate... but the earnings are huge. But if your loop executes only once, and always, is normally because the test you wrote is always true (a tautology) or always false (impossible fact) and can be eliminated, this makes the jump back unnecessary, and so, there's no loop anymore.
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo(); // you missed a semicolon here.
}
In your case, the variable is assigned a value, that is never touched anymore, so the first thing the compiler is going to do is to replace all expressions of your variable by the literal you assigned to it. (lacking context I assume blocsize is a local automatic variable that is not changed anywhere else) Your code changes into:
for (volatile int i=0; i<1; i++)
{
foo();
}
the next is that volatile is not necessary, as its scope is the block body of the loop, where it is not used, so it can be replaced by a sequence of code like the following:
do {
foo();
} while (0);
hmmm.... this code can be replaced by this code:
foo();
The compiler analyses each data set analising the graph of dependencies between data and variables.... when a variable is not needed anymore, assigning a value to it is not necessary (if it is not used later in the program or goes out of life), so that code is eliminated. If you make your compiler to compile a for loop frrom 1 to 2^64, and then stop. and you optimize the compilation of that,, you will see you loop being trashed up and will get the false idea that your processor is capable of counting from 1 to 2^64 in less than a second.... but that is not true, 2^64 is still very big number to be counted in less than a second. And that is not a one fixed pass loop like yours.... but the data calculations done in the program are of no use, so the compiler eliminates it.
Just test the following program (in this case it is not a test of a just one pass loop, but 2^64-1 executions):
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
uint64_t low = 0UL;
uint64_t high = ~0UL;
uint64_t data = 0; // this data is updated in the loop body.
printf("counting from %lu to %lu\n", low, high);
alarm(10); /* security break after 10 seconds */
for (uint64_t i = low; i < high; i++) {
#if 0
printf("data = $lu\n", data = i ); // either here...
#else
data = i; // or here...
#endif
}
return 0;
}
(You can change the #if 0 to #if 1 to see how the optimizer doesn't eliminate the loop when you need to print the results, but you see that the program is essentially the same, except for the call to printf with the result of the assignment)
Just compile it with/without optimization:
$ cc -O0 pru.c -o pru_noopt
$ cc -O2 pru.c -o pru_optim
and then run it under time:
$ time pru_noopt
counting from 0 to 18446744073709551615
Alarm clock
real 0m10,005s
user 0m9,848s
sys 0m0,000s
while running the optimized version gives:
$ time pru_optim
counting from 0 to 18446744073709551615
real 0m0,002s
user 0m0,002s
sys 0m0,002s
(impossible, neither the best computer can count one after the other, upto that number in less than 2 milliseconds) so the loop must have gone somewhere else. You can check from the assembler code. As the updated value of data is not used after assignment, the loop body can be eliminated, so the 2^64-1 executions of it can also be eliminated.
Now add the following line after the loop:
printf("data = %lu\n", data);
You will see that then, even with the -O3 option, will get the loop untouched, because the value after all the assignments is used after the loop.
(I preferred not to show the assembler code, and remain in high level, but you can have a look at the assembler code and see the actual generated code)

Benchmarking C struct comparsion: XOR vs ==

Say we have a simple struct in C that has 4 fields:
typedef struct {
int a;
int b;
int c;
int d;
} value_st;
Let's take a look at these two short versions of C struct equal check.
The first one is straight-forward and does the following:
int compare1(const value_st *x1, const value_st *x2) {
return ( (x1->a == x2->a) && (x1->b == x2->b) &&
(x1->c == x2->c) && (x1->d == x2->d) );
}
The second one uses XOR:
int compare2(const value_st *x1, const value_st *x2) {
return ( (x1->a ^ x2->a) | (x1->b ^ x2->b) |
(x1->c ^ x2->c) | (x1->d ^ x2->d);
}
The first version will return nonzero if both structs are equal.
and the second version will return zero iff the two structs are equal.
Compiler Output
Compiling with GCC -O2 and examining the assembly looks like what we expect.
The first version is 4 CMP instructions and JMPS:
xor %eax,%eax
mov (%rsi),%edx
cmp %edx,(%rdi)
je 0x9c0 <compare1+16>
repz retq
nopw 0x0(%rax,%rax,1)
mov 0x4(%rsi),%ecx
cmp %ecx,0x4(%rdi)
jne 0x9b8 <compare1+8>
mov 0x8(%rsi),%ecx
cmp %ecx,0x8(%rdi)
jne 0x9b8 <compare1+8>
mov 0xc(%rsi),%eax
cmp %eax,0xc(%rdi)
sete %al
movzbl %al,%eax
retq
The second version looks like this:
mov (%rdi),%eax
mov 0x4(%rdi),%edx
xor (%rsi),%eax
xor 0x4(%rsi),%edx
or %edx,%eax
mov 0x8(%rdi),%edx
xor 0x8(%rsi),%edx
or %edx,%eax
mov 0xc(%rdi),%edx
xor 0xc(%rsi),%edx
or %edx,%eax
retq
So the second version has:
no branches
less instructions
Benchmarking
static uint64_t
now_msec() {
struct timespec spec;
clock_gettime(CLOCK_MONOTONIC, &spec);
return ((uint64_t)spec.tv_sec * 1000) + (spec.tv_nsec / 1000000);
}
void benchmark() {
uint64_t start = now_msec();
uint64_t sum = 0;
for (uint64_t i = 0; i < 1e10; i++) {
if (compare1(&x1, &x2)) {
sum++;
}
}
uint64_t delta_ms = now_msec() - start;
// use sum and delta here
}
Enough iterations to filter out the time it takes to call clock_gettime()
But here is the thing I don't get...
When I benchmark equal structs where all the instructions need to be executed,
the first version is faster...
time took for compare == is 3114 [ms] [matches: 10000000000]
time took for compare XOR is 3177 [ms] [matches: 10000000000]
How is this possible ?
Even with branch prediction, XOR should be super fast instruction and
not lose to CMP/JMP
Update
Couple of important notes:
This question is mainly to understand the outcome. not to try to beat the compiler or create an obscure code - it is always better to write clean code and let the compiler optimize
We assume the structs are in the cache, otherwise the dominating factor will be obviously the memory lookup
Branch prediction will obviously play a part...but can it be better than branchless code (given that most of the time we execute all the code) ?
memcmp will require zero padding in the struct and also might need a loop / if in most standard implementations, as it supports variable size comparison
Update 2
Many have stated that the difference is tiny per call...this is true but is consistent which means that this difference is in favor of the first version in many consecutive runs
Update 3
I've copied my test code to a lab server with a Intel(R) Xeon(R) CPU E5-2667 v3 # 3.20GHz
The XOR version runs almost two times faster on the server for GCC 8.
Tried with both clang and GCC 8:
For GCC 8:
time took for compare == is 7432 [ms] [matches: 3000000000]
time took for compare XOR is 4214 [ms] [matches: 3000000000]
for Clang:
time took for compare == is 4265 [ms] [matches: 3000000000]
time took for compare XOR is 5508 [ms] [matches: 3000000000]
So it seems like this is very compiler and CPU dependent.
Well, in the first case there are 4 mov's and 4 cmp's. In the second case there are 4 mov's, 4 xor's and 4 or's. As jmp's not taken take in effect no time, the first version is faster. (cmp and xor do basically the same thing and should execute in the same amount of time)
The moral of the story here is that you should never try to outsmart your compiler, it really knows better (at least in 99.99% of cases)
And never obscure the intent of your program in an effort to make it faster, unless you have hard evidence it is (1) needed and (2) effective.
time took for compare == is 3114 [ms] [matches: 10000000000]
time took for compare XOR is 3177 [ms] [matches: 10000000000]
How is this possible ?
Because actual execution time is affected by many factors out of your control, which is why you should never rely on a single run of a benchmarking program to make any decisions. Run it many times, under different load conditions, and average the results.
Secondly, this run shows a difference of 63 milliseconds out of a little over 3 seconds, or 2%, for one billion comparisons between the two methods. As far as a person sitting in front of the screen is concerned, that's barely noticable. If your results consistently showed a difference of a full second or more that would be worth investigating, but this is down in the noise.
And finally, what is going to be the more common operation in the real code - comparing identical structs or non-identical structs? If the second case is going to be more common, even if just by a bare majority of 51%, then the == method will be significantly faster on average due to short-circuiting.
When optimizing code, look at the big picture - don't hyperfocus on a single operation. You'll wind up writing code that's hard to read, harder to maintain, and probably not as optimized as you think it is.

Faster approach to checking for an all-zero buffer in C?

I am searching for a faster method of accomplishing this:
int is_empty(char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[i] != 0) return 0;
}
return 1;
}
I realize I'm searching for a micro optimization unnecessary except in extreme cases, but I know a faster method exists, and I'm curious what it is.
On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.
Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.
Expressing this is difficult in portable C: casting a char* to long* violates strict aliasing. But fortunately you can use memcpy to portably express an unaligned multi-byte load that can alias anything. Compilers will optimize it to the asm you want.
For example, this work-in-progress implementation (https://godbolt.org/z/3hXQe7) on the Godbolt compiler explorer shows that you can get a good inner loop (with some startup overhead) from loading two consecutive uint_fast32_t vars (often 64-bit) with memcpy and then checking tmp1 | tmp2, because many CPUs will set flags according to an OR result, so this lets you check two words for the price of one.
Getting it to compile efficiently for targets without efficient unaligned loads requires some manual alignment in the startup code, and even then gcc may not inline the memcpy for loads where it can't prove alignment.
One potential way, inspired by Kieveli's dismissed idea:
int is_empty(char *buf, size_t size)
{
static const char zero[999] = { 0 };
return !memcmp(zero, buf, size > 999 ? 999 : size);
}
Note that you can't make this solution work for arbitrary sizes. You could do this:
int is_empty(char *buf, size_t size)
{
char *zero = calloc(size);
int i = memcmp(zero, buf, size);
free(zero);
return i;
}
But any dynamic memory allocation is going to be slower than what you have. The only reason the first solution is faster is because it can use memcmp(), which is going to be hand-optimized in assembly language by the library writers and will be much faster than anything you could code in C.
EDIT: An optimization no one else has mentioned, based on earlier observations about the "likelyness" of the buffer to be in state X: If a buffer isn't empty, will it more likely not be empty at the beginning or the end? If it's more likely to have cruft at the end, you could start your check at the end and probably see a nice little performance boost.
EDIT 2: Thanks to Accipitridae in the comments:
int is_empty(char *buf, size_t size)
{
return buf[0] == 0 && !memcmp(buf, buf + 1, size - 1);
}
This basically compares the buffer to itself, with an initial check to see if the first element is zero. That way, any non-zero elements will cause memcmp() to fail. I don't know how this would compare to using another version, but I do know that it will fail quickly (before we even loop) if the first element is nonzero. If you're more likely to have cruft at the end, change buf[0] to buf[size] to get the same effect.
The benchmarks given above (https://stackoverflow.com/a/1494499/2154139) are not accurate. They imply that func3 is much faster than the other options.
However, if you change the order of the tests, so that func3 comes before func2, you'd see func2 is much faster.
Careful when running combination benchmarks within a single execution... the side effects are large, especially when reusing the same variables. Better to run the tests isolated!
For example, changing it to:
int main(){
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
}
gives me:
func3: zero 14243
func3: zero 1142
func3: zero 885
func3: zero 848
func3: zero 870
This was really bugging me as I couldn't see how func3 could perform so much faster than func2.
(apologize for the answer, and not as a comment, didn't have reputation)
Four functions for testing zeroness of a buffer with simple benchmarking:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <inttypes.h>
#define SIZE (8*1024)
char zero[SIZE] __attribute__(( aligned(8) ));
#define RDTSC(var) __asm__ __volatile__ ( "rdtsc" : "=A" (var));
#define MEASURE( func ) { \
uint64_t start, stop; \
RDTSC( start ); \
int ret = func( zero, SIZE ); \
RDTSC( stop ); \
printf( #func ": %s %12"PRIu64"\n", ret?"non zero": "zero", stop-start ); \
}
int func1( char *buff, size_t size ){
while(size--) if(*buff++) return 1;
return 0;
}
int func2( char *buff, size_t size ){
return *buff || memcmp(buff, buff+1, size-1);
}
int func3( char *buff, size_t size ){
return *(uint64_t*)buff || memcmp(buff, buff+sizeof(uint64_t), size-sizeof(uint64_t));
}
int func4( char *buff, size_t size ){
return *(wchar_t*)buff || wmemcmp((wchar_t*)buff, (wchar_t*)buff+1, size/sizeof(wchar_t)-1);
}
int main(){
MEASURE( func1 );
MEASURE( func2 );
MEASURE( func3 );
MEASURE( func4 );
}
Result on my old PC:
func1: zero 108668
func2: zero 38680
func3: zero 8504
func4: zero 24768
If your program is x86 only or x64 only, you can easily optimize using inline assambler. The REPE SCASD instruction will scan a buffer until a non EAX dword is found.
Since there is no equivalent standard library function, no compiler/optimizer will probably be able to use these instructions (as Confirmed by Sufian's code).
From the head, something like this would do if your buffer length is 4-bytes aligned (MASM syntax):
_asm {
CLD ; search forward
XOR EAX, EAX ; search for non-zero
LEA EDI, [buf] ; search in buf
MOV ECX, [buflen] ; search buflen bytes
SHR ECX, 2 ; using dwords so len/=4
REPE SCASD ; perform scan
JCXZ bufferEmpty: ; completes? then buffer is 0
}
Tomas
EDIT: updated with Tony D's fixes
For something so simple, you'll need to see what code the compiler is generating.
$ gcc -S -O3 -o empty.s empty.c
And the contents of the assembly:
.text
.align 4,0x90
.globl _is_empty
_is_empty:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %edx ; edx = pointer to buffer
movl 8(%ebp), %ecx ; ecx = size
testl %edx, %edx
jle L3
xorl %eax, %eax
cmpb $0, (%ecx)
jne L5
.align 4,0x90
L6:
incl %eax ; real guts of the loop are in here
cmpl %eax, %edx
je L3
cmpb $0, (%ecx,%eax) ; compare byte-by-byte of buffer
je L6
L5:
leave
xorl %eax, %eax
ret
.align 4,0x90
L3:
leave
movl $1, %eax
ret
.subsections_via_symbols
This is very optimized. The loop does three things:
Increase the offset
Compare the offset to the size
Compare the byte-data in memory at base+offset to 0
It could be optimized slightly more by comparing at a word-by-word basis, but then you'd need to worry about alignment and such.
When all else fails, measure first, don't guess.
Try checking the buffer using an int-sized variable where possible (it should be aligned).
Off the top of my head (uncompiled, untested code follows - there's almost certainly at least one bug here. This just gives the general idea):
/* check the start of the buf byte by byte while it's unaligned */
while (size && !int_aligned( buf)) {
if (*buf != 0) {
return 0;
}
++buf;
--size;
}
/* check the bulk of the buf int by int while it's aligned */
size_t n_ints = size / sizeof( int);
size_t rem = size / sizeof( int);
int* pInts = (int*) buf;
while (n_ints) {
if (*pInt != 0) {
return 0;
}
++pInt;
--n_ints;
}
/* now wrap up the remaining unaligned part of the buf byte by byte */
buf = (char*) pInts;
while (rem) {
if (*buf != 0) {
return 0;
}
++buf;
--rem;
}
return 1;
With x86 you can use SSE to test 16 bytes at a time:
#include "smmintrin.h" // note: requires SSE 4.1
int is_empty(const char *buf, const size_t size)
{
size_t i;
for (i = 0; i + 16 <= size; i += 16)
{
__m128i v = _mm_loadu_si128((m128i *)&buf[i]);
if (!_mm_testz_si128(v, v))
return 0;
}
for ( ; i < size; ++i)
{
if (buf[i] != 0)
return 0;
}
return 1;
}
This can probably be further improved with loop unrolling.
On modern x86 CPUs with AVX you can even use 256 bit SIMD and test 32 bytes at a time.
The Hackers Delight book/site is all about optimized C/assembly. Lots of good references from that site also and is fairly up to date (AMD64, NUMA techniques also).
Look at fast memcpy - it can be adapted for memcmp (or memcmp against a constant value).
I see a lot of people saying things about alignment issues preventing you from doing word sized accesses, but that's not always true. If you're looking to make portable code, then this is certainly an issue, however x86 will actually tolerate misaligned accesses. For exmaple this will only fail on the x86 if alignment checking is turned on in EFLAGS (and of course buf is actuallly not word aligned).
int is_empty(char * buf, int size) {
int i;
for(i = 0; i < size; i+= 4) {
if(*(int *)(buf + i) != 0) {
return 0;
}
}
for(; i < size; i++) {
if(buf[i] != 0)
return 0;
}
return 1;
}
Regardless the compiler CAN convert your original loop into a loop of word-based comparisons with extra jumps to handle alignment issues, however it will not do this at any normal optimization level because it lacks information. For cases when size is small, unrolling the loop in this way will make the code slower, and the compiler wants to be conservative.
A way to get around this is to make use of profile guided optimizations. If you let GCC get profile information on the is_empty function then re-compile it, it will be willing to unroll the loop into word-sized comparisons with an alignment check. You can also force this behavior with -funroll-all-loops
Did anyone mention unrolling the loop? In any of these loops, the loop overhead and indexing is going to be significant.
Also, what is the probability that the buffer will actually be empty? That's the only case where you have to check all of it.
If there typically is some garbage in the buffer, the loop should stop very early, so it doesn't matter.
If you plan to clear it to zero if it's not zero, it would probably be faster just to clear it with memset(buf, 0, sizeof(buf)), whether or not it's already zero.
What about looping from size to zero (cheaper checks):
int is_empty(char * buf, int size)
{
while(size --> 0) {
if(buf[size] != 0) return 0;
}
return 1;
}
It must be noted that we probably cannot outperform the compiler, so enable the most aggressive speed optimization in your compiler and assume that you're likely to not go any faster.
Or handling everything using pointers (not tested, but likely to perform quite good):
int is_empty(char* buf, int size)
{
char* org = buf;
if (buf[size-1] == 1)
return 0;
buf[size-1] = 1;
while(! *buf++);
buf--;
return buf == org[size-1];
}
You stated in your question that you are looking for a most likely unnecessary micro-optimization. In 'normal' cases the ASM approach by Thomas and others should give you the fastest results.
Still, this is forgetting the big picture. If your buffer is really large, then starting from the start and essential do a linear search is definitely not the fastest way to do this. Assume your cp replacement is quite good at finding large consecutive empty regions but has a few non-empty bytes at the end of the array. All linear searches would require reading the whole array. On the other hand a quicksort inspired algorithm could search for any non-zero elements and abort much faster for a large enough dataset.
So before doing any kind of micro-optimization I would look closely at the data in your buffer and see if that gives you any patterns. For a single '1', randomly distributed in the buffer a linear search (disregarding threading/parallelization) will be the fastest approach, in other cases not necessarily so.
Inline assembly version of the initial C code (no error checking, if uiSize is == 0 and/or the array is not allocated exceptions will be generated. Perhaps use try {} catch() as this might be faster than adding a lot of check to the code. Or do as I do, try not to call functions with invalid values (usually does not work). At least add a NULL pointer check and a size != 0 check, that is very easy.
unsigned int IsEmpty(char* pchBuffer, unsigned int uiSize)
{
asm {
push esi
push ecx
mov esi, [pchBuffer]
mov ecx, [uiSize]
// add NULL ptr and size check here
mov eax, 0
next_char:
repe scasb // repeat string instruction as long as BYTE ptr ds:[ESI] == 0
// scasb does pointer arithmetic for BYTES (chars), ie it copies a byte to al and increments ESI by 1
cmp cx,0 // did the loop complete?
je all_chars_zero // yes, array is all 0
jmp char_not_zero // no, loop was interrupted due to BYTE PTR ds:[ESI] != 0
all_chars_zero:
mov eax, 1 // Set return value (works in MASM)
jmp end
char_not_zero:
mov eax, 0 // Still not sure if this works in inline asm
end:
pop ecx
pop esi
}
}
That is written on the fly, but it looks correct enough, corrections are welcome. ANd if someone known about how to set the return value from inline asm, please do tell.
int is_empty(char * buf, int size)
{
int i, content=0;
for(i = 0; !content && i < size; i++)
{
content=content | buf(i); // bitwise or
}
return (content==0);
}
int is_empty(char * buf, int size)
{
return buf[0] == '\0';
}
If your buffer is not a character string, I think that's the fastest way to check...
memcmp() would require you to create a buffer the same size and then use memset to set it all as 0. I doubt that would be faster...
Edit: Bad answer
A novel approach might be
int is_empty(char * buf, int size) {
char start = buf[0];
char end = buff[size-1];
buf[0] = 'x';
buf[size-1] = '\0';
int result = strlen(buf) == 0;
buf[0] = start;
buff[size-1] = end;
return result;
}
Why the craziness? because strlen is one of the library function that's more likely to be optimized.
Storing and replacing the first character is to prevent the false positive. Storing and replacing the last character is to make sure it terminates.
The initial C algorithm is pretty much as slow as it can be in VALID C.
If you insist on using C then try a "while" loop instead of "for":
int i = 0;
while (i< MAX)
{
// operate on the string
i++;
}
This is pretty much the fastest 1 dimensional string operation loop you can write in C, besides if you can force the compiler to put i in a register with the "register" keyword, but I am told that this is almost always ignored by modern compilers.
Also searching a constant sized array to check if it is empty is very wasteful and also 0 is not empty, it is value in the array.
A better solution for speed would to use a dynamic array (int* piBuffer) and a variable that stores the current size (unsigned int uiBufferSize), when the array is empty then the pointer is NULL, and uiBufferSize is 0. Make a class with these two as protected member variables. One could also easily write a template for dynamic arrays, which would store 32 bit values, either primitive types or pointers, for primitive types there is not really any way to test for "empty" (I interpret this as "undefined"), but you can of course define 0 to represent an available entry. For an array pointers you should initialize all entries to NULL, and set entry to NULL when you have just deallocated that memory. And NULL DOES mean "points at nothing" so this is very convenient way to represent empty. One should not use dynamically resized arrays in really complicated algorithms, at least not in the development phase, there are simply too many things that can go wrong. One should at least first implement the algorithm using an STL Container (or well tested alternative) and then when the code works one can swap the tested container for a simple dynamic array (and if you can avoid resizing the array too often the code will both be faster and more fail safe.
A better solution for complicated and cool code is to use either std::vector or a std::map (or any container class STL, homegrown or 3rd party) depending on your needs, but looking at your code I would say that the std::vector is enough. The STL Containers are templates so they should be pretty fast too. Use STL Container to store object pointers (always store object pointers and not the actual objects, copying entire objects for every entry will really mess up your execution speed) and dynamic arrays for more basic data (bitmap, sound etc.) ie primitive types. Generally.
I came up with the REPE SCASW solution independtly by studying x86 assembly language manuals, and I agree that the example using this string operation instruction is the fastest. The other assembly example which has separate compare, jump etc. instructions is almost certainly slower (but still much faster than the initial C code, so still a good post), as the string operations are among the most highly optimized on all modern CPUs, they may even have their own logic circuitry (anyone knows?).
The REPE SCASD does not need to fetch a new instruction nor increase the instruction pointer, and that is just the stuff an assembly novice like me can come up with and and on top of that is the hardware optimization, string operations are critical for almost all kinds of modern software in particular multimedia application (copy PCM sound data, uncompressed bitmap data, etc.), so optimizing these instructions must have been very high priority every time a new 80x86 chip was being designed.
I use it for a novel 2d sprite collision algorithm.
It says that I am not allowed to have an opinion, so consider the following an objective assessment: Modern compilers (UNMANAGED C/C++, pretty much everything else is managed code and is slow as hell) are pretty good at optimizing, but it cannot be avoided that for VERY specific tasks the compiler generates redundant code. One could look at the assembly that the compiler outputs so that one does not have to translate a complicated algorithm entirely from scratch, even though it is very fun to do (for some) and it is much more rewarding doing code the hard way, but anyway, algorithms using "for" loops, in particular with regards to string operations, can often be optimized very significantly as the for loop generates a lot of code, that is often not needed, example:
for (int i = 1000; i>0; i--) DoSomething(); This line generates at 6-10 lines of assembly if the compiler is not very clever (it might be), but the optimized assembly version CAN be:
mov cx, 1000
_DoSomething:
// loop code....or call Func, slower but more readable
loop _DoSomething
That was 2 lines, and it does exactly the same as the C line (it uses registers instead of memory addresses, which is MUCH faster, but arguably this is not EXACTLY the same as the C line, but that is semantics) , how much of an optimization this example is depends on how well modern compilers optimize, which I have no clue on, but the algorithm analysis based on the goal of implementing an algorithm with the fewest and faster assembly lines often works well, I have had very good results with first implementing the algorithm in C/C++ without caring about optimization and then translate and optimize it in assembly. The fact that each C line becomes many assembly lines often makes some optimizations very obvious, and also some instructions are faster than others:
INC DX ; is faster than:
ADD DX,1 ;if ADD DX,1 is not just replaced with INC DX by the assembler or the CPU
LOOP ; is faster than manually decreasing, comparing and jumping
REPxx STOSx/MOVSx/LODSx is faster than using cmp, je/jne/jea etc and loop
JMP or conditional jumping is faster than using CALL, so in a loop that is executed VERY frequently (like rendering), including functions in the code so it is accessible with "local" jumps can also boost performance.
The last bit is very relevant for this question, fast string operations.
So this post is not all rambling.
And lastly, design you assembly algorithm in the way that requires the least amount of jumps for a typical execution.
Also don't bother optimizing code that is not called that often, use a profiler and see what code is called most often, and start with that, anything that is called less than 20 times a second (and completes much faster than 1000 ms/ 20) is not really worth optimizing. Look at code that it not synchronized to timers and the like and is executed again immediately after is has completed. On the other hand if your rendering loop can do 100+ FPS on a modest machine, it does not make sense economically to optimize it, but real coders love to code and do not care about economics, they optimize the AppStart() method into 100% assembly even though it is only called once :) Or use a z rotation matrix to rotate Tetris pieces 90 degrees :P Anyone who does that is awesome!
If anyone has some constructive correction, which is not VERY hurtful, then I would love to hear it, I code almost entirely by myself, so I am not really exposed to any influences. I once paid a nice Canadian game developer to teach my Direct3d and though I could just as easily have read a book, the interaction with another coder who was somewhat above my level in certain areas was fun.
Thanks for good content generally. I think I will go and answer some of the simpler questions, give a little back.

What is the fastest way to swap values in C?

I want to swap two integers, and I want to know which of these two implementations will be faster:
The obvious way with a temp variable:
void swap(int* a, int* b)
{
int temp = *a;
*a = *b;
*b = temp;
}
Or the xor version that I'm sure most people have seen:
void swap(int* a, int* b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
It seems like the first uses an extra register, but the second one is doing three loads and stores while the first only does two of each. Can someone tell me which is faster and why? The why being more important.
Number 2 is often quoted as being the "clever" way of doing it. It is in fact most likely slower as it obscures the explicit aim of the programmer - swapping two variables. This means that a compiler can't optimize it to use the actual assembler ops to swap. It also assumes the ability to do a bitwise xor on the objects.
Stick to number 1, it's the most generic and most understandable swap and can be easily templated/genericized.
This wikipedia section explains the issues quite well:
http://en.wikipedia.org/wiki/XOR_swap_algorithm#Reasons_for_avoidance_in_practice
The XOR method fails if a and b point to the same address. The first XOR will clear all of the bits at the memory address pointed to by both variables, so once the function returns (*a == *b == 0), regardless of the initial value.
More info on the Wiki page:
XOR swap algorithm
Although it's not likely that this issue would come up, I'd always prefer to use the method that's guaranteed to work, not the clever method that fails at unexpected moments.
On a modern processor, you could use the following when sorting large arrays and see no difference in speed:
void swap (int *a, int *b)
{
for (int i = 1 ; i ; i <<= 1)
{
if ((*a & i) != (*b & i))
{
*a ^= i;
*b ^= i;
}
}
}
The really important part of your question is the 'why?' part. Now, going back 20 years to the 8086 days, the above would have been a real performance killer, but on the latest Pentium it would be a match speed wise to the two you posted.
The reason is purely down to memory and has nothing to do with the CPU.
CPU speeds compared to memory speeds have risen astronomically. Accessing memory has become the major bottleneck in application performance. All the swap algorithms will be spending most of their time waiting for data to be fetched from memory. Modern OS's can have up to 5 levels of memory:
Cache Level 1 - runs at the same speed as the CPU, has negligible access time, but is small
Cache Level 2 - runs a bit slower than L1 but is larger and has a bigger overhead to access (usually, data needs to be moved to L1 first)
Cache Level 3 - (not always present) Often external to the CPU, slower and bigger than L2
RAM - the main system memory, usually implements a pipeline so there's latency in read requests (CPU requests data, message sent to RAM, RAM gets data, RAM sends data to CPU)
Hard Disk - when there's not enough RAM, data is paged to HD which is really slow, not really under CPU control as such.
Sorting algorithms will make memory access worse since they usually access the memory in a very unordered way, thus incurring the inefficient overhead of fetching data from L2, RAM or HD.
So, optimising the swap method is really pointless - if it's only called a few times then any inefficiency is hidden due to the small number of calls, if it's called a lot then any inefficiency is hidden due to the number of cache misses (where the CPU needs to get data from L2 (1's of cycles), L3 (10's of cycles), RAM (100's of cycles), HD (!)).
What you really need to do is look at the algorithm that calls the swap method. This is not a trivial exercise. Although the Big-O notation is useful, an O(n) can be significantly faster than a O(log n) for small n. (I'm sure there's a CodingHorror article about this.) Also, many algorithms have degenerate cases where the code does more than is necessary (using qsort on nearly ordered data could be slower than a bubble sort with an early-out check). So, you need to analyse your algorithm and the data it's using.
Which leads to how to analyse the code. Profilers are useful but you do need to know how to interpret the results. Never use a single run to gather results, always average results over many executions - because your test application could have been paged to hard disk by the OS halfway through. Always profile release, optimised builds, profiling debug code is pointless.
As to the original question - which is faster? - it's like trying to figure out if a Ferrari is faster than a Lambourgini by looking at the size and shape of the wing mirror.
The first is faster because bitwise operations such as xor are usually very hard to visualize for the reader.
Faster to understand of course, which is the most important part ;)
Regarding #Harry:
Never implement functions as macros for the following reasons:
Type safety. There is none. The following only generates a warning when compiling but fails at run time:
float a=1.5f,b=4.2f;
swap (a,b);
A templated function will always be of the correct type (and why aren't you treating warnings as errors?).
EDIT: As there's no templates in C, you need to write a separate swap for each type or use some hacky memory access.
It's a text substitution. The following fails at run time (this time, without compiler warnings):
int a=1,temp=3;
swap (a,temp);
It's not a function. So, it can't be used as an argument to something like qsort.
Compilers are clever. I mean really clever. Made by really clever people. They can do inlining of functions. Even at link time (which is even more clever). Don't forget that inlining increases code size. Big code means more chance of cache miss when fetching instructions, which means slower code.
Side effects. Macros have side effects! Consider:
int &f1 ();
int &f2 ();
void func ()
{
swap (f1 (), f2 ());
}
Here, f1 and f2 will be called twice.
EDIT: A C version with nasty side effects:
int a[10], b[10], i=0, j=0;
swap (a[i++], b[j++]);
Macros: Just say no!
EDIT: This is why I prefer to define macro names in UPPERCASE so that they stand out in the code as a warning to use with care.
EDIT2: To answer Leahn Novash's comment:
Suppose we have a non-inlined function, f, that is converted by the compiler into a sequence of bytes then we can define the number of bytes thus:
bytes = C(p) + C(f)
where C() gives the number of bytes produced, C(f) is the bytes for the function and C(p) is the bytes for the 'housekeeping' code, the preamble and post-amble the compiler adds to the function (creating and destroying the function's stack frame and so on). Now, to call function f requires C(c) bytes. If the function is called n times then the total code size is:
size = C(p) + C(f) + n.C(c)
Now let's inline the function. C(p), the function's 'housekeeping', becomes zero since the function can use the stack frame of the caller. C(c) is also zero since there is now no call opcode. But, f is replicated wherever there was a call. So, the total code size is now:
size = n.C(f)
Now, if C(f) is less than C(c) then the overall executable size will be reduced. But, if C(f) is greater than C(c) then the code size is going to increase. If C(f) and C(c) are similar then you need to consider C(p) as well.
So, how many bytes do C(f) and C(c) produce. Well, the simplest C++ function would be a getter:
void GetValue () { return m_value; }
which would probably generate the four byte instruction:
mov eax,[ecx + offsetof (m_value)]
which is four bytes. A call instuction is five bytes. So, there is an overall size saving. If the function is more complex, say an indexer ("return m_value [index];") or a calculation ("return m_value_a + m_value_b;") then the code will be bigger.
For those to stumble upon this question and decide to use the XOR method. You should consider inlining your function or using a macro to avoid the overhead of a function call:
#define swap(a, b) \
do { \
int temp = a; \
a = b; \
b = temp; \
} while(0)
Never understood the hate for macros. When used properly they can make code more compact and readable. I believe most programmers know macros should be used with care, what is important is making it clear that a particular call is a macro and not a function call (all caps). If SWAP(a++, b++); is a consistent source of problems, perhaps programming is not for you.
Admittedly, the xor trick is neat the first 5000 times you see it, but all it really does is save one temporary at the expense of reliability. Looking at the assembly generated above it saves a register but creates dependencies. Also I would not recommend xchg since it has an implied lock prefix.
Eventually we all come to the same place, after countless hours wasted on unproductive optimization and debugging caused by our most clever code - Keep it simple.
#define SWAP(type, a, b) \
do { type t=(a);(a)=(b);(b)=t; } while (0)
void swap(size_t esize, void* a, void* b)
{
char* x = (char*) a;
char* y = (char*) b;
char* z = x + esize;
for ( ; x < z; x++, y++ )
SWAP(char, *x, *y);
}
You are optimizing the wrong thing, both of those should be so fast that you'll have to run them billions of times just to get any measurable difference.
And just about anything will have much greater effect on your performance, for example, if the values you are swapping are close in memory to the last value you touched they are lily to be in the processor cache, otherwise you'll have to access the memory - and that is several orders of magnitude slower then any operation you do inside the processor.
Anyway, your bottleneck is much more likely to be an inefficient algorithm or inappropriate data structure (or communication overhead) then how you swap numbers.
The only way to really know is to test it, and the answer may even vary depending on what compiler and platform you are on. Modern compilers are really good at optimizing code these days, and you should never try to outsmart the compiler unless you can prove that your way is really faster.
With that said, you'd better have a damn good reason to choose #2 over #1. The code in #1 is far more readable and because of that should always be chosen first. Only switch to #2 if you can prove that you need to make that change, and if you do - comment it to explain what's happening and why you did it the non-obvious way.
As an anecdote, I work with a couple of people that love to optimize prematurely and it makes for really hideous, unmaintainable code. I'm also willing to bet that more often than not they're shooting themselves in the foot because they've hamstrung the ability of the compiler to optimize the code by writing it in a non-straightforward way.
For modern CPU architectures, method 1 will be faster, also with higher readability than method 2.
On modern CPU architectures, the XOR technique is considerably slower than using a temporary variable to do swapping. One reason is that modern CPUs strive to execute instructions in parallel via instruction pipelines. In the XOR technique, the inputs to each operation depend on the results of the previous operation, so they must be executed in strictly sequential order. If efficiency is of tremendous concern, it is advised to test the speeds of both the XOR technique and temporary variable swapping on the target architecture. Check out here for more info.
Edit: Method 2 is a way of in-place swapping (i.e. without using extra variables). To make this question complete, I will add another in-place swapping by using +/-.
void swap(int* a, int* b)
{
if (a != b) // important to handle a/b share the same reference
{
*a = *a+*b;
*b = *a-*b;
*a = *a-*b;
}
}
I would not do it with pointers unless you have to. The compiler cannot optimize them very well because of the possibility of pointer aliasing (although if you can GUARANTEE that the pointers point to non-overlapping locations, GCC at least has extensions to optimize this).
And I would not do it with functions at all, since it's a very simple operation and the function call overhead is significant.
The best way to do it is with macros if raw speed and the possibility of optimization is what you require. In GCC you can use the typeof() builtin to make a flexible version that works on any built-in type.
Something like this:
#define swap(a,b) \
do { \
typeof(a) temp; \
temp = a; \
a = b; \
b = temp; \
} while (0)
...
{
int a, b;
swap(a, b);
unsigned char x, y;
swap(x, y); /* works with any type */
}
With other compilers, or if you require strict compliance with standard C89/99, you would have to make a separate macro for each type.
A good compiler will optimize this as aggressively as possible, given the context, if called with local/global variables as arguments.
All the top rated answers are not actually definitive "facts"... they are people who are speculating!
You can definitively know for a fact which code takes less assembly instructions to execute because you can look at the output assembly generated by the compiler and see which executes in less assembly instructions!
Here is the c code I compiled with flags "gcc -std=c99 -S -O3 lookingAtAsmOutput.c":
#include <stdio.h>
#include <stdlib.h>
void swap_traditional(int * restrict a, int * restrict b)
{
int temp = *a;
*a = *b;
*b = temp;
}
void swap_xor(int * restrict a, int * restrict b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
int main() {
int a = 5;
int b = 6;
swap_traditional(&a,&b);
swap_xor(&a,&b);
}
ASM output for swap_traditional() takes >>> 11 <<< instructions ( not including "leave", "ret", "size"):
.globl swap_traditional
.type swap_traditional, #function
swap_traditional:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
pushl %ebx
movl (%edx), %ebx
movl (%ecx), %eax
movl %ebx, (%ecx)
movl %eax, (%edx)
popl %ebx
popl %ebp
ret
.size swap_traditional, .-swap_traditional
.p2align 4,,15
ASM output for swap_xor() takes >>> 11 <<< instructions not including "leave" and "ret":
.globl swap_xor
.type swap_xor, #function
swap_xor:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
movl 12(%ebp), %edx
movl (%ecx), %eax
xorl (%edx), %eax
movl %eax, (%ecx)
xorl (%edx), %eax
xorl %eax, (%ecx)
movl %eax, (%edx)
popl %ebp
ret
.size swap_xor, .-swap_xor
.p2align 4,,15
Summary of assembly output:
swap_traditional() takes 11 instructions
swap_xor() takes 11 instructions
Conclusion:
Both methods use the same amount of instructions to execute and therefore are approximately the same speed on this hardware platform.
Lesson learned:
When you have small code snippets, looking at the asm output is helpful to rapidly iterate your code and come up with the fastest ( i.e. least instructions ) code. And you can save time even because you don't have to run the program for each code change. You only need to run the code change at the end with a profiler to show that your code changes are faster.
I use this method a lot for heavy DSP code that needs speed.
To answer your question as stated would require digging into the instruction timings of the particular CPU that this code will be running on which therefore require me to make a bunch of assumptions around the state of the caches in the system and the assembly code emitted by the compiler. It would be an interesting and useful exercise from the perspective of understanding how your processor of choice actually works but in the real world the difference will be negligible.
x=x+y-(y=x);
float x; cout << "X:"; cin >> x;
float y; cout << "Y:" ; cin >> y;
cout << "---------------------" << endl;
cout << "X=" << x << ", Y=" << y << endl;
x=x+y-(y=x);
cout << "X=" << x << ", Y=" << y << endl;
In my opinion local optimizations like this should only be considered tightly related to the platform. It makes a huge difference if you are compiling this on a 16 bit uC compiler or on gcc with x64 as target.
If you have a specific target in mind then just try both of them and look at the generated asm code or profile your applciation with both methods and see which is actually faster on your platform.
If you can use some inline assembler and do the following (psuedo assembler):
PUSH A
A=B
POP B
You will save a lot of parameter passing and stack fix up code etc.
I just placed both swaps (as macros) in hand written quicksort I've been playing with. The XOR version was much faster (0.1sec) then the one with the temporary variable (0.6sec). The XOR did however corrupt the data in the array (probably the same address thing Ant mentioned).
As it was a fat pivot quicksort, the XOR version's speed is probably from making large portions of the array the same. I tried a third version of swap which was the easiest to understand and it had the same time as the single temporary version.
acopy=a;
bcopy=b;
a=bcopy;
b=acopy;
[I just put an if statements around each swap, so it won't try to swap with itself, and the XOR now takes the same time as the others (0.6 sec)]
If your compiler supports inline assembler and your target is 32-bit x86 then the XCHG instruction is probably the best way to do this... if you really do care that much about performance.
Here is a method which works with MSVC++:
#include <stdio.h>
#define exchange(a,b) __asm mov eax, a \
__asm xchg eax, b \
__asm mov a, eax
int main(int arg, char** argv)
{
int a = 1, b = 2;
printf("%d %d --> ", a, b);
exchange(a,b)
printf("%d %d\r\n", a, b);
return 0;
}
void swap(int* a, int* b)
{
*a = (*b - *a) + (*b = *a);
}
// My C is a little rusty, so I hope I got the * right :)
Below piece of code will do the same. This snippet is optimized way of programming as it doesn't use any 3rd variable.
x = x ^ y;
y = x ^ y;
x = x ^ y;
Another beautiful way.
#define Swap( a, b ) (a)^=(b)^=(a)^=(b)
Advantage
No need of function call and handy.
Drawback:
This fails when both inputs are same variable. It can be used only on integer variables.

Resources