Profile C Execution

Profile C Execution - c

So, just for fun, and out of curiosity, I wanted to see what executes faster when doing an even-odd check, modulus or bitwise comparisons.
So, I whipped up the following, but I'm not sure that it's behaving correctly, as the difference is so small. I read somewhere online that bitwise should be an order of magnitude faster than modulus checking.
Is it possible that it's getting optimized away? I've just started tinkering with assembly, otherwise I'd attempt to dissect the executable a bit.
EDIT 3: Here is a working test, thanks in a large way to #phonetagger:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
// to reset the global
static const int SEED = 0x2A;
// 5B iterations, each
static const int64_t LOOPS = 5000000000;
int64_t globalVar;
// gotta call something
int64_t doSomething( int64_t input )
{
return 1 + input;
}
int main(int argc, char *argv[])
{
globalVar = SEED;
// mod
clock_t startMod = clock();
for( int64_t i=0; i<LOOPS; ++i )
{
if( ( i % globalVar ) == 0 )
{
globalVar = doSomething(globalVar);
}
}
clock_t endMod = clock();
double modTime = (double)(endMod - startMod) / CLOCKS_PER_SEC;
globalVar = SEED;
// bit
clock_t startBit = clock();
for( int64_t j=0; j<LOOPS; ++j )
{
if( ( j & globalVar ) == 0 )
{
globalVar = doSomething(globalVar);
}
}
clock_t endBit = clock();
double bitTime = (double)(endBit - startBit) / CLOCKS_PER_SEC;
printf("Mod: %lf\n", modTime);
printf("Bit: %lf\n", bitTime);
printf("Dif: %lf\n", ( modTime > bitTime ? modTime-bitTime : bitTime-modTime ));
}
5 billion iterations of each loop, with a global removing compiler optimization yields the following:
Mod: 93.099101
Bit: 16.701401
Dif: 76.397700

gcc foo.c -std=c99 -S -O0 (note, I specifically did -O0) for x86 gave me the same assembly for both loops. Operator strength reduction meant that both ifs used an andl to get the job done (which is faster than a modulo on Intel machines):
First Loop:
.L6:
movl 72(%esp), %eax
andl $1, %eax
testl %eax, %eax
jne .L5
call doNothing
.L5:
addl $1, 72(%esp)
.L4:
movl LOOPS, %eax
cmpl %eax, 72(%esp)
jl .L6
Second Loop:
.L9:
movl 76(%esp), %eax
andl $1, %eax
testl %eax, %eax
jne .L8
call doNothing
.L8:
addl $1, 76(%esp)
.L7:
movl LOOPS, %eax
cmpl %eax, 76(%esp)
jl .L9
The miniscule difference you see is probably because of the resolution/inaccuracy of clock.

Most compilers will compile both of the following to EXACTLY the same machine instruction(s):
if( ( i % 2 ) == 0 )
if( ( i & 1 ) == 0 )
...even without ANY "optimization" turned on. The reason is that you are MOD-ing and AND-ing with constant values, and a %2 operation is, as any compiler writer should know, functionally equivalent to an &1 operation. In fact, MOD by any power-of-2 has an equivalent AND operation. If you really want to test the difference, you'll need to make the right-hand-side of both operations be variable, and to be absolutely sure the compiler's cleverness isn't thwarting your efforts, you'll need to bury the variables' initializations somewhere that the compiler can't tell at that point what its runtime value will be; i.e. you'll need to pass the values into a GLOBALLY-DECLARED (i.e. not 'static') test function as parameters, in which case the compiler can't trace back to their definition & substitute the variables with constants, because theoretically any external caller could pass any values in for those parameters. Alternatively, you could leave the code in main() and define the variables globally, in which case the compiler can't substitute them with constants because it can't know for sure that another function may have altered the value of the global variables.
Incidentally, this same issue exists for divide operations.... Divisions by constant powers-of-two can be substituted with an equivalent right-shift (>>) operation. The same trick works for multiplication (<<), but the benefits are less (or nonexistant) for multiplications. True division operations just take a long time in hardware, though significant improvements have been made in most modern processors vs. even 15 years ago, division operations still take maybe 80 clock cycles, while a >> operation takes only a single cycle. You're not going to see an "order of magnitude" improvement using bitwise tricks on modern processors, but most compilers will still use those tricks because there is still some noticeable improvement.
EDIT: On some embedded processors (and, unbelievable though it was, the original Sparc desktop/workstation processor versions before v8), there isn't even a divide instruction at all. All true divide & mod operations on such processors must be performed entirely in software, which can be a monstrously expensive operation. In that sort of environment, you surely would see an order of magnitude difference.

Bitwise checking takes only a single machine instruction ("and ...,0x01"); that's pretty hard to beat.
Modulo check will absolutely be slower if you have a dumb compiler that actually computes modulo by taking remainders (sometimes including a subroutine call to modulo routine!). Smart compilers know about the modulo function and generate code for it directly; if they have any decent optimization they know that "modulo(x,2)" can be implemented with the same AND trick above.
Our PARLANSE compiler does this as a matter of course. I'd be surprised if widely available C and C++ compilers don't do this too.
With such "good" compilers, it won't matter which way you write odd/even (or even "is power of two") checks; it will be pretty damn fast.

Related

Are there advantages of using `setp` instead of `setb`?

When compiling
double isnan(double x){
return x!=x
}
both clang and gcc utilize the parity-flag PF:
_Z6is_nand: # #_Z6is_nand
ucomisd %xmm0, %xmm0
setp %al
retq
However, the two possible outcomes of the comparison are:
NaN Not-Nan
ZF 1 1
PF 1 0
CF 1 0
that means it would be also possible to use the CF-flag as alternative, i.e. setb instead of setp.
Are there any advantages of using setp over setb, or is it a coincidence, that both compilers use the parity flag?
PS: This question is the following up to Understanding compilation result for std::isnan

The advantage is that the compiler emits this code naturally without needing a special case to recognize x!=x and transform it into !(x >= x).
Without -ffast-math, x != y has to check PF to see if the comparison is ordered, then check ZF for equality. In special case where both inputs are the same, presumably normal optimization mechanisms like CSE can get rid of the ZF check, leaving only PF.
In this case, setb wouldn't be worse, but it has absolutely no advantage, and it's more confusing for humans, and it probably needs more special-case code for the compiler to emit it.
Your suggested transformation would only be useful when using the result with special instruction that use CF, like adc. For example, nan_counter += arr[i] != arr[i]. That auto-vectorizes trivially (cmp_unord_ps / psubd), but scalar cleanup (or a scalar use-case over non-array inputs) could use ucomiss / adc $0, %eax instead of ucomiss / setp / add.
That saves an instruction, and a uop on Broadwell and later, and on AMD. (Earlier Intel CPUs have 2 uop adc, unless they special-case $0, because they don't support 3-input uops)

Why would < be slower than <=? [C]

Naturally, I've assumed the < and <= operators run at the same speed (per Jonathon Reinhart's logic, here). Recently, I decided to test that assumption, and I was a little surprised by my results.
I know, for most modern hardware, this question is purely academic, so had to write test programs that looped about 1 billion times (to get any minuscule difference to add up to more acceptable levels). The programs were as basic as possible (to cut out all possible sources of interference).
lt.c:
int main() {
for (int i = 0; i < 1000000001; i++);
return 0;
}
le.c:
int main() {
for (int i = 0; i <= 1000000000; i++);
return 0;
}
They were compiled and run on a Linux VirtualBox 3.19.0-18-generic #18-Ubuntu x86_64 installation, using GCC with the -std=c11 flag set.
The average time for lt.c's binary was:
real 0m2.404s
user 0m2.389s
sys 0m0.000s
The average time for le.c was:
real 0m2.397s
user 0m2.384s
sys 0m0.000s
The difference is small, but I couldn't get it to go away or reverse no matter how many times I ran the binaries.
I made the comparison value in the for-loop of lt.c one larger than le.c (so they'd both loop the same number of times). Was this somehow a mistake?
According the answer in Is < faster than <=?, < compiles to jge and <= compiles to jg. That was dealing with if statements rather than a for-loop, but could this still be the reason? Could the execution of jge take slightly longer than jg? (I think this would be ironic since that would mean moving from C to ASM inverts which one is the more complicated instruction, with lt in C translating to gte in ASM and lte to gt.)
Or, is this just so hardware specific that different x86 lines or individual chips may consistently show the reverse trend, the same trend, or no difference?

There were a few requests in the comments to my question to include the assembly being generated for me by GCC. After getting to compiler to pop out the assembly versions of each file, I checked it.
Result:
It turns out the default optimization setting turned both for-loops into the same assembly. Both files were identical in assembly-form, actually. (diff confirmed this.)
Possible reason for the previously observed time difference:
It seems the order in which I ran the binaries was the cause for the run time difference.
On a given runthrough, the programs generally were executed quicker with each successive execution, before plateauing after about 3 executions.
I alternated back and forth between time ./lt and time ./le, so the one run first would have a bias towards extra time in its average.
I usually ran lt first.
I did several separate runthroughs (increasing the averaged bias).
Code excerpt:
movl $0, -4(%rbp)
jmp .L2
.L3:
addl $1, -4($rbp)
.L2
cmpl $1000000000, -4(%rbp)
jle .L3
mol $0, %eax
pop %rbp
... * covers face * ...carry on....

Let's speak in assembly. (depends on the architecture of course)
When comparing you'll use cmp or test instruction and then
- when you use < the equal instruction would be jl which checks if SF and OF are not the same (some special flags called sign and overflow)
- when you use <= the equal instruction is jle which checks not only SF != OF but also ZF == 1 (zero flag)
and so one, more here
but honestly it's not even the whole cycle so...I think the difference is unmeasurable under normal circumstances

number of clock cycles in if statement in c program?

Sorry i was not specific with the problem. i am trying to use intrinsic bit-parallelism of a system .a small part of the code is as follows-
int d;
char ch1;
char ch2;
cin>>ch1>>ch2;
if((d&1) == 0) {
//heavy computation
}
if(ch1 == ch2){
//heavy computation
}
first if condition execute if lsb of d is set.
how many clock cycles the two 'if' conditions require to execute?
include the clock cycles required to convert the variable values in binary form.

On a i386 architecture and with gcc the assembly code produced for the abode conditions would be,
for condition 1:
subl $16, %esp
movb $97, -2(%ebp)
movb $98, -1(%ebp)
movl -12(%ebp), %eax
andl $1, %eax
testl %eax, %eax
jne .L2
for condition 2:
movzbl -2(%ebp), %eax
cmpb -1(%ebp), %al
jne .L4
So for simplicity we consider the i386 is a MIPS with RISC core and it fallows the fallowing table:
number of clock cycles for the above statements would be 18.
Actually when you compile with "gcc -S file.c" the assembly for the 2 conditions is not produced as the compiler might go for the optimization of the null conditions(ineffective conditions or the dead code), so try to include some useful statements inside the conditions and compile the code you would get the above stated instructions.

With any good compiler, the if statements shown in this question would not consume any processor cycles in an executing program. This is because the compiler would recognize that neither of the if statements does anything, regardless of whether the condition is true or false, so they would be removed during optimization.
In general, optimization can dramatically transform a program. Even if the if statements had statements in their then-clauses, the compiler could determine at compile-time that ch1 does not equal ch2, so there is no need to perform the comparison during program execution.
Beyond that, if a condition is tested during program execution, there is often not a clear correlation between evaluating the test and how many processor cycles it takes. Modern processors are quite complicated, and a test and branch might be executed speculatively in advance while other instructions are also executing, so that the if statement does not cost the program any time at all. On the other hand, executing a branch might cause the processor to discard many instructions it had been preparing to execute and to reload new instructions from the new branch destination, thus costing the program many cycles.
In fact, both of these effects might occur for the same if statement in the same program. When the if statement is used in a loop with many executions, the processor may cache information about the branch decision and use that to speed up execution. At another time, when the if statement happens to be executed just once (because the loop conditions are different), the cached information may mislead the processor and cost cycles.

Probably you can compile your complete code and disassemble it using GDB. Once disassembled find out number and type (Load (5 cycles) Store (4 cycles) Branch (3 cycles) Jump (3 cycles) etc.,) of instructions your mentioned statements took. Sum of such cycles result to clock cycles consumed. However this depends on what processor you are on.

By looking at your question, i think you need to calculate number of instruction executed for your statement and then calculate cycles for every instruction in your if else
Code:
if(x == 0)
{
x = 1;
}
x++;
This will consume following number of instructions
mov eax, $x
cmp eax, 0
jne end
mov eax, 1
end:
inc eax
mov $x, eax
so first if statement will consume 2cpu cycles
Adding to your particular code
cin>>ch1>>ch2;
if((d&1) == 0) {
//heavy computation
}
if(ch1 == ch2){
//heavy computation
}
you need to get instruction required in those two if else operations from which you can calculate cycles.
Also you need to add something inside ( if(){body} ) in body of if statements else modern compilers are intelligent remove your code considering it is dead code.

It depends on your "IF".
Take this to the simplest case that you want to compare two bytes, you probably only need 2 clock cycles in an instruction, ie. 1111 0001 which means (1st) activating ALU-CMP and setting data from R0 to TMP; (2nd) carrying R1 onto the bus and setting the output to ACC.
Otherwise, you will need at least other 3 clocks for fetching, 1 clock for checking I/O interrupt, and 1 final clock to reset the instruction register.
Therefore, on the circuit scale, you only need 7 clock cycles to execute an "IF" for 2 bytes. However, you would never write an "IF" just to compare two numbers (represented by two bytes), wouldn't you? 😅

Divide without losing remainder

In C, is it possible to divide a dividend by a constant and get the result and the remainder at the same time?
I want to avoid execution of 2 division instructions, as in this example:
val=num / 10;
mod=num % 10;

I wouldn't worry about the instruction count because the x86 instruction set will provide a idivl instruction that computes the dividend and remainder in one instruction. Any decent compiler will make use of this instruction. The documenation here http://programminggroundup.blogspot.com/2007/01/appendix-b-common-x86-instructions.html describes the instruction as follows:
Performs unsigned division. Divides the contents of the double-word
contained in the combined %edx:%eax registers by the value in the
register or memory location specified. The %eax register contains the
resulting quotient, and the %edx register contains the resulting
remainder. If the quotient is too large to fit in %eax, it triggers a
type 0 interrupt.
For example, compiling this sample program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
int x = 39;
int divisor = 1;
int div = 0;
int rem = 0;
printf("Enter the divisor: ");
scanf("%d", &divisor);
div = x/divisor;
rem = x%divisor;
printf("div = %d, rem = %d\n", div, rem);
}
With gcc -S -O2 (-S saves the tempory file created that shows the asm listing), shows that the division and mod in the following lines
div = x/divisor;
rem = x%divisor;
is effectively reduced to the following instruction:
idivl 28(%esp)
As you can see theres one instruction to perform the division and mod calculation. The idivl instruction remains even if the mod calculation in the C program is removed. After the idivl there are calls to mov:
movl $.LC2, (%esp)
movl %edx, 8(%esp)
movl %eax, 4(%esp)
call printf
These calls copy the quotient and the remainder onto the stack for the call to printf.
Update
Interestingly the function div doesn't do anything special other than wrap the / and % operators in a function call. Therefore, from a performance perspective, it will not improve the performance by replacing the lines
val=num / 10;
mod=num % 10;
with a single call to div.

There's div():
div_t result = div(num, 10);
// quotient is result.quot
// remainder is result.rem

Don't waste your time with div() Like Nemo said, the compiler will easily optimize the use of a division followed by the use of a modulus operation into one. Write code that makes optimal sense, and let the computer remove the cruft.

You could always use the div function.

What is the fastest way to swap values in C?

I want to swap two integers, and I want to know which of these two implementations will be faster:
The obvious way with a temp variable:
void swap(int* a, int* b)
{
int temp = *a;
*a = *b;
*b = temp;
}
Or the xor version that I'm sure most people have seen:
void swap(int* a, int* b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
It seems like the first uses an extra register, but the second one is doing three loads and stores while the first only does two of each. Can someone tell me which is faster and why? The why being more important.

Number 2 is often quoted as being the "clever" way of doing it. It is in fact most likely slower as it obscures the explicit aim of the programmer - swapping two variables. This means that a compiler can't optimize it to use the actual assembler ops to swap. It also assumes the ability to do a bitwise xor on the objects.
Stick to number 1, it's the most generic and most understandable swap and can be easily templated/genericized.
This wikipedia section explains the issues quite well:
http://en.wikipedia.org/wiki/XOR_swap_algorithm#Reasons_for_avoidance_in_practice

The XOR method fails if a and b point to the same address. The first XOR will clear all of the bits at the memory address pointed to by both variables, so once the function returns (*a == *b == 0), regardless of the initial value.
More info on the Wiki page:
XOR swap algorithm
Although it's not likely that this issue would come up, I'd always prefer to use the method that's guaranteed to work, not the clever method that fails at unexpected moments.

On a modern processor, you could use the following when sorting large arrays and see no difference in speed:
void swap (int *a, int *b)
{
for (int i = 1 ; i ; i <<= 1)
{
if ((*a & i) != (*b & i))
{
*a ^= i;
*b ^= i;
}
}
}
The really important part of your question is the 'why?' part. Now, going back 20 years to the 8086 days, the above would have been a real performance killer, but on the latest Pentium it would be a match speed wise to the two you posted.
The reason is purely down to memory and has nothing to do with the CPU.
CPU speeds compared to memory speeds have risen astronomically. Accessing memory has become the major bottleneck in application performance. All the swap algorithms will be spending most of their time waiting for data to be fetched from memory. Modern OS's can have up to 5 levels of memory:
Cache Level 1 - runs at the same speed as the CPU, has negligible access time, but is small
Cache Level 2 - runs a bit slower than L1 but is larger and has a bigger overhead to access (usually, data needs to be moved to L1 first)
Cache Level 3 - (not always present) Often external to the CPU, slower and bigger than L2
RAM - the main system memory, usually implements a pipeline so there's latency in read requests (CPU requests data, message sent to RAM, RAM gets data, RAM sends data to CPU)
Hard Disk - when there's not enough RAM, data is paged to HD which is really slow, not really under CPU control as such.
Sorting algorithms will make memory access worse since they usually access the memory in a very unordered way, thus incurring the inefficient overhead of fetching data from L2, RAM or HD.
So, optimising the swap method is really pointless - if it's only called a few times then any inefficiency is hidden due to the small number of calls, if it's called a lot then any inefficiency is hidden due to the number of cache misses (where the CPU needs to get data from L2 (1's of cycles), L3 (10's of cycles), RAM (100's of cycles), HD (!)).
What you really need to do is look at the algorithm that calls the swap method. This is not a trivial exercise. Although the Big-O notation is useful, an O(n) can be significantly faster than a O(log n) for small n. (I'm sure there's a CodingHorror article about this.) Also, many algorithms have degenerate cases where the code does more than is necessary (using qsort on nearly ordered data could be slower than a bubble sort with an early-out check). So, you need to analyse your algorithm and the data it's using.
Which leads to how to analyse the code. Profilers are useful but you do need to know how to interpret the results. Never use a single run to gather results, always average results over many executions - because your test application could have been paged to hard disk by the OS halfway through. Always profile release, optimised builds, profiling debug code is pointless.
As to the original question - which is faster? - it's like trying to figure out if a Ferrari is faster than a Lambourgini by looking at the size and shape of the wing mirror.

The first is faster because bitwise operations such as xor are usually very hard to visualize for the reader.
Faster to understand of course, which is the most important part ;)

Regarding #Harry:
Never implement functions as macros for the following reasons:
Type safety. There is none. The following only generates a warning when compiling but fails at run time:
float a=1.5f,b=4.2f;
swap (a,b);
A templated function will always be of the correct type (and why aren't you treating warnings as errors?).
EDIT: As there's no templates in C, you need to write a separate swap for each type or use some hacky memory access.
It's a text substitution. The following fails at run time (this time, without compiler warnings):
int a=1,temp=3;
swap (a,temp);
It's not a function. So, it can't be used as an argument to something like qsort.
Compilers are clever. I mean really clever. Made by really clever people. They can do inlining of functions. Even at link time (which is even more clever). Don't forget that inlining increases code size. Big code means more chance of cache miss when fetching instructions, which means slower code.
Side effects. Macros have side effects! Consider:
int &f1 ();
int &f2 ();
void func ()
{
swap (f1 (), f2 ());
}
Here, f1 and f2 will be called twice.
EDIT: A C version with nasty side effects:
int a[10], b[10], i=0, j=0;
swap (a[i++], b[j++]);
Macros: Just say no!
EDIT: This is why I prefer to define macro names in UPPERCASE so that they stand out in the code as a warning to use with care.
EDIT2: To answer Leahn Novash's comment:
Suppose we have a non-inlined function, f, that is converted by the compiler into a sequence of bytes then we can define the number of bytes thus:
bytes = C(p) + C(f)
where C() gives the number of bytes produced, C(f) is the bytes for the function and C(p) is the bytes for the 'housekeeping' code, the preamble and post-amble the compiler adds to the function (creating and destroying the function's stack frame and so on). Now, to call function f requires C(c) bytes. If the function is called n times then the total code size is:
size = C(p) + C(f) + n.C(c)
Now let's inline the function. C(p), the function's 'housekeeping', becomes zero since the function can use the stack frame of the caller. C(c) is also zero since there is now no call opcode. But, f is replicated wherever there was a call. So, the total code size is now:
size = n.C(f)
Now, if C(f) is less than C(c) then the overall executable size will be reduced. But, if C(f) is greater than C(c) then the code size is going to increase. If C(f) and C(c) are similar then you need to consider C(p) as well.
So, how many bytes do C(f) and C(c) produce. Well, the simplest C++ function would be a getter:
void GetValue () { return m_value; }
which would probably generate the four byte instruction:
mov eax,[ecx + offsetof (m_value)]
which is four bytes. A call instuction is five bytes. So, there is an overall size saving. If the function is more complex, say an indexer ("return m_value [index];") or a calculation ("return m_value_a + m_value_b;") then the code will be bigger.

For those to stumble upon this question and decide to use the XOR method. You should consider inlining your function or using a macro to avoid the overhead of a function call:
#define swap(a, b) \
do { \
int temp = a; \
a = b; \
b = temp; \
} while(0)

Never understood the hate for macros. When used properly they can make code more compact and readable. I believe most programmers know macros should be used with care, what is important is making it clear that a particular call is a macro and not a function call (all caps). If SWAP(a++, b++); is a consistent source of problems, perhaps programming is not for you.
Admittedly, the xor trick is neat the first 5000 times you see it, but all it really does is save one temporary at the expense of reliability. Looking at the assembly generated above it saves a register but creates dependencies. Also I would not recommend xchg since it has an implied lock prefix.
Eventually we all come to the same place, after countless hours wasted on unproductive optimization and debugging caused by our most clever code - Keep it simple.
#define SWAP(type, a, b) \
do { type t=(a);(a)=(b);(b)=t; } while (0)
void swap(size_t esize, void* a, void* b)
{
char* x = (char*) a;
char* y = (char*) b;
char* z = x + esize;
for ( ; x < z; x++, y++ )
SWAP(char, *x, *y);
}

You are optimizing the wrong thing, both of those should be so fast that you'll have to run them billions of times just to get any measurable difference.
And just about anything will have much greater effect on your performance, for example, if the values you are swapping are close in memory to the last value you touched they are lily to be in the processor cache, otherwise you'll have to access the memory - and that is several orders of magnitude slower then any operation you do inside the processor.
Anyway, your bottleneck is much more likely to be an inefficient algorithm or inappropriate data structure (or communication overhead) then how you swap numbers.

The only way to really know is to test it, and the answer may even vary depending on what compiler and platform you are on. Modern compilers are really good at optimizing code these days, and you should never try to outsmart the compiler unless you can prove that your way is really faster.
With that said, you'd better have a damn good reason to choose #2 over #1. The code in #1 is far more readable and because of that should always be chosen first. Only switch to #2 if you can prove that you need to make that change, and if you do - comment it to explain what's happening and why you did it the non-obvious way.
As an anecdote, I work with a couple of people that love to optimize prematurely and it makes for really hideous, unmaintainable code. I'm also willing to bet that more often than not they're shooting themselves in the foot because they've hamstrung the ability of the compiler to optimize the code by writing it in a non-straightforward way.

For modern CPU architectures, method 1 will be faster, also with higher readability than method 2.
On modern CPU architectures, the XOR technique is considerably slower than using a temporary variable to do swapping. One reason is that modern CPUs strive to execute instructions in parallel via instruction pipelines. In the XOR technique, the inputs to each operation depend on the results of the previous operation, so they must be executed in strictly sequential order. If efficiency is of tremendous concern, it is advised to test the speeds of both the XOR technique and temporary variable swapping on the target architecture. Check out here for more info.
Edit: Method 2 is a way of in-place swapping (i.e. without using extra variables). To make this question complete, I will add another in-place swapping by using +/-.
void swap(int* a, int* b)
{
if (a != b) // important to handle a/b share the same reference
{
*a = *a+*b;
*b = *a-*b;
*a = *a-*b;
}
}

I would not do it with pointers unless you have to. The compiler cannot optimize them very well because of the possibility of pointer aliasing (although if you can GUARANTEE that the pointers point to non-overlapping locations, GCC at least has extensions to optimize this).
And I would not do it with functions at all, since it's a very simple operation and the function call overhead is significant.
The best way to do it is with macros if raw speed and the possibility of optimization is what you require. In GCC you can use the typeof() builtin to make a flexible version that works on any built-in type.
Something like this:
#define swap(a,b) \
do { \
typeof(a) temp; \
temp = a; \
a = b; \
b = temp; \
} while (0)
...
{
int a, b;
swap(a, b);
unsigned char x, y;
swap(x, y); /* works with any type */
}
With other compilers, or if you require strict compliance with standard C89/99, you would have to make a separate macro for each type.
A good compiler will optimize this as aggressively as possible, given the context, if called with local/global variables as arguments.

All the top rated answers are not actually definitive "facts"... they are people who are speculating!
You can definitively know for a fact which code takes less assembly instructions to execute because you can look at the output assembly generated by the compiler and see which executes in less assembly instructions!
Here is the c code I compiled with flags "gcc -std=c99 -S -O3 lookingAtAsmOutput.c":
#include <stdio.h>
#include <stdlib.h>
void swap_traditional(int * restrict a, int * restrict b)
{
int temp = *a;
*a = *b;
*b = temp;
}
void swap_xor(int * restrict a, int * restrict b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
int main() {
int a = 5;
int b = 6;
swap_traditional(&a,&b);
swap_xor(&a,&b);
}
ASM output for swap_traditional() takes >>> 11 <<< instructions ( not including "leave", "ret", "size"):
.globl swap_traditional
.type swap_traditional, #function
swap_traditional:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
pushl %ebx
movl (%edx), %ebx
movl (%ecx), %eax
movl %ebx, (%ecx)
movl %eax, (%edx)
popl %ebx
popl %ebp
ret
.size swap_traditional, .-swap_traditional
.p2align 4,,15
ASM output for swap_xor() takes >>> 11 <<< instructions not including "leave" and "ret":
.globl swap_xor
.type swap_xor, #function
swap_xor:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
movl 12(%ebp), %edx
movl (%ecx), %eax
xorl (%edx), %eax
movl %eax, (%ecx)
xorl (%edx), %eax
xorl %eax, (%ecx)
movl %eax, (%edx)
popl %ebp
ret
.size swap_xor, .-swap_xor
.p2align 4,,15
Summary of assembly output:
swap_traditional() takes 11 instructions
swap_xor() takes 11 instructions
Conclusion:
Both methods use the same amount of instructions to execute and therefore are approximately the same speed on this hardware platform.
Lesson learned:
When you have small code snippets, looking at the asm output is helpful to rapidly iterate your code and come up with the fastest ( i.e. least instructions ) code. And you can save time even because you don't have to run the program for each code change. You only need to run the code change at the end with a profiler to show that your code changes are faster.
I use this method a lot for heavy DSP code that needs speed.

To answer your question as stated would require digging into the instruction timings of the particular CPU that this code will be running on which therefore require me to make a bunch of assumptions around the state of the caches in the system and the assembly code emitted by the compiler. It would be an interesting and useful exercise from the perspective of understanding how your processor of choice actually works but in the real world the difference will be negligible.

x=x+y-(y=x);
float x; cout << "X:"; cin >> x;
float y; cout << "Y:" ; cin >> y;
cout << "---------------------" << endl;
cout << "X=" << x << ", Y=" << y << endl;
x=x+y-(y=x);
cout << "X=" << x << ", Y=" << y << endl;

In my opinion local optimizations like this should only be considered tightly related to the platform. It makes a huge difference if you are compiling this on a 16 bit uC compiler or on gcc with x64 as target.
If you have a specific target in mind then just try both of them and look at the generated asm code or profile your applciation with both methods and see which is actually faster on your platform.

If you can use some inline assembler and do the following (psuedo assembler):
PUSH A
A=B
POP B
You will save a lot of parameter passing and stack fix up code etc.

I just placed both swaps (as macros) in hand written quicksort I've been playing with. The XOR version was much faster (0.1sec) then the one with the temporary variable (0.6sec). The XOR did however corrupt the data in the array (probably the same address thing Ant mentioned).
As it was a fat pivot quicksort, the XOR version's speed is probably from making large portions of the array the same. I tried a third version of swap which was the easiest to understand and it had the same time as the single temporary version.
acopy=a;
bcopy=b;
a=bcopy;
b=acopy;
[I just put an if statements around each swap, so it won't try to swap with itself, and the XOR now takes the same time as the others (0.6 sec)]

If your compiler supports inline assembler and your target is 32-bit x86 then the XCHG instruction is probably the best way to do this... if you really do care that much about performance.
Here is a method which works with MSVC++:
#include <stdio.h>
#define exchange(a,b) __asm mov eax, a \
__asm xchg eax, b \
__asm mov a, eax
int main(int arg, char** argv)
{
int a = 1, b = 2;
printf("%d %d --> ", a, b);
exchange(a,b)
printf("%d %d\r\n", a, b);
return 0;
}

void swap(int* a, int* b)
{
*a = (*b - *a) + (*b = *a);
}
// My C is a little rusty, so I hope I got the * right :)

Below piece of code will do the same. This snippet is optimized way of programming as it doesn't use any 3rd variable.
x = x ^ y;
y = x ^ y;
x = x ^ y;

Another beautiful way.
#define Swap( a, b ) (a)^=(b)^=(a)^=(b)
Advantage
No need of function call and handy.
Drawback:
This fails when both inputs are same variable. It can be used only on integer variables.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight