What is the fastest way to swap values in C? - c

I want to swap two integers, and I want to know which of these two implementations will be faster:
The obvious way with a temp variable:
void swap(int* a, int* b)
{
int temp = *a;
*a = *b;
*b = temp;
}
Or the xor version that I'm sure most people have seen:
void swap(int* a, int* b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
It seems like the first uses an extra register, but the second one is doing three loads and stores while the first only does two of each. Can someone tell me which is faster and why? The why being more important.

Number 2 is often quoted as being the "clever" way of doing it. It is in fact most likely slower as it obscures the explicit aim of the programmer - swapping two variables. This means that a compiler can't optimize it to use the actual assembler ops to swap. It also assumes the ability to do a bitwise xor on the objects.
Stick to number 1, it's the most generic and most understandable swap and can be easily templated/genericized.
This wikipedia section explains the issues quite well:
http://en.wikipedia.org/wiki/XOR_swap_algorithm#Reasons_for_avoidance_in_practice

The XOR method fails if a and b point to the same address. The first XOR will clear all of the bits at the memory address pointed to by both variables, so once the function returns (*a == *b == 0), regardless of the initial value.
More info on the Wiki page:
XOR swap algorithm
Although it's not likely that this issue would come up, I'd always prefer to use the method that's guaranteed to work, not the clever method that fails at unexpected moments.

On a modern processor, you could use the following when sorting large arrays and see no difference in speed:
void swap (int *a, int *b)
{
for (int i = 1 ; i ; i <<= 1)
{
if ((*a & i) != (*b & i))
{
*a ^= i;
*b ^= i;
}
}
}
The really important part of your question is the 'why?' part. Now, going back 20 years to the 8086 days, the above would have been a real performance killer, but on the latest Pentium it would be a match speed wise to the two you posted.
The reason is purely down to memory and has nothing to do with the CPU.
CPU speeds compared to memory speeds have risen astronomically. Accessing memory has become the major bottleneck in application performance. All the swap algorithms will be spending most of their time waiting for data to be fetched from memory. Modern OS's can have up to 5 levels of memory:
Cache Level 1 - runs at the same speed as the CPU, has negligible access time, but is small
Cache Level 2 - runs a bit slower than L1 but is larger and has a bigger overhead to access (usually, data needs to be moved to L1 first)
Cache Level 3 - (not always present) Often external to the CPU, slower and bigger than L2
RAM - the main system memory, usually implements a pipeline so there's latency in read requests (CPU requests data, message sent to RAM, RAM gets data, RAM sends data to CPU)
Hard Disk - when there's not enough RAM, data is paged to HD which is really slow, not really under CPU control as such.
Sorting algorithms will make memory access worse since they usually access the memory in a very unordered way, thus incurring the inefficient overhead of fetching data from L2, RAM or HD.
So, optimising the swap method is really pointless - if it's only called a few times then any inefficiency is hidden due to the small number of calls, if it's called a lot then any inefficiency is hidden due to the number of cache misses (where the CPU needs to get data from L2 (1's of cycles), L3 (10's of cycles), RAM (100's of cycles), HD (!)).
What you really need to do is look at the algorithm that calls the swap method. This is not a trivial exercise. Although the Big-O notation is useful, an O(n) can be significantly faster than a O(log n) for small n. (I'm sure there's a CodingHorror article about this.) Also, many algorithms have degenerate cases where the code does more than is necessary (using qsort on nearly ordered data could be slower than a bubble sort with an early-out check). So, you need to analyse your algorithm and the data it's using.
Which leads to how to analyse the code. Profilers are useful but you do need to know how to interpret the results. Never use a single run to gather results, always average results over many executions - because your test application could have been paged to hard disk by the OS halfway through. Always profile release, optimised builds, profiling debug code is pointless.
As to the original question - which is faster? - it's like trying to figure out if a Ferrari is faster than a Lambourgini by looking at the size and shape of the wing mirror.

The first is faster because bitwise operations such as xor are usually very hard to visualize for the reader.
Faster to understand of course, which is the most important part ;)

Regarding #Harry:
Never implement functions as macros for the following reasons:
Type safety. There is none. The following only generates a warning when compiling but fails at run time:
float a=1.5f,b=4.2f;
swap (a,b);
A templated function will always be of the correct type (and why aren't you treating warnings as errors?).
EDIT: As there's no templates in C, you need to write a separate swap for each type or use some hacky memory access.
It's a text substitution. The following fails at run time (this time, without compiler warnings):
int a=1,temp=3;
swap (a,temp);
It's not a function. So, it can't be used as an argument to something like qsort.
Compilers are clever. I mean really clever. Made by really clever people. They can do inlining of functions. Even at link time (which is even more clever). Don't forget that inlining increases code size. Big code means more chance of cache miss when fetching instructions, which means slower code.
Side effects. Macros have side effects! Consider:
int &f1 ();
int &f2 ();
void func ()
{
swap (f1 (), f2 ());
}
Here, f1 and f2 will be called twice.
EDIT: A C version with nasty side effects:
int a[10], b[10], i=0, j=0;
swap (a[i++], b[j++]);
Macros: Just say no!
EDIT: This is why I prefer to define macro names in UPPERCASE so that they stand out in the code as a warning to use with care.
EDIT2: To answer Leahn Novash's comment:
Suppose we have a non-inlined function, f, that is converted by the compiler into a sequence of bytes then we can define the number of bytes thus:
bytes = C(p) + C(f)
where C() gives the number of bytes produced, C(f) is the bytes for the function and C(p) is the bytes for the 'housekeeping' code, the preamble and post-amble the compiler adds to the function (creating and destroying the function's stack frame and so on). Now, to call function f requires C(c) bytes. If the function is called n times then the total code size is:
size = C(p) + C(f) + n.C(c)
Now let's inline the function. C(p), the function's 'housekeeping', becomes zero since the function can use the stack frame of the caller. C(c) is also zero since there is now no call opcode. But, f is replicated wherever there was a call. So, the total code size is now:
size = n.C(f)
Now, if C(f) is less than C(c) then the overall executable size will be reduced. But, if C(f) is greater than C(c) then the code size is going to increase. If C(f) and C(c) are similar then you need to consider C(p) as well.
So, how many bytes do C(f) and C(c) produce. Well, the simplest C++ function would be a getter:
void GetValue () { return m_value; }
which would probably generate the four byte instruction:
mov eax,[ecx + offsetof (m_value)]
which is four bytes. A call instuction is five bytes. So, there is an overall size saving. If the function is more complex, say an indexer ("return m_value [index];") or a calculation ("return m_value_a + m_value_b;") then the code will be bigger.

For those to stumble upon this question and decide to use the XOR method. You should consider inlining your function or using a macro to avoid the overhead of a function call:
#define swap(a, b) \
do { \
int temp = a; \
a = b; \
b = temp; \
} while(0)

Never understood the hate for macros. When used properly they can make code more compact and readable. I believe most programmers know macros should be used with care, what is important is making it clear that a particular call is a macro and not a function call (all caps). If SWAP(a++, b++); is a consistent source of problems, perhaps programming is not for you.
Admittedly, the xor trick is neat the first 5000 times you see it, but all it really does is save one temporary at the expense of reliability. Looking at the assembly generated above it saves a register but creates dependencies. Also I would not recommend xchg since it has an implied lock prefix.
Eventually we all come to the same place, after countless hours wasted on unproductive optimization and debugging caused by our most clever code - Keep it simple.
#define SWAP(type, a, b) \
do { type t=(a);(a)=(b);(b)=t; } while (0)
void swap(size_t esize, void* a, void* b)
{
char* x = (char*) a;
char* y = (char*) b;
char* z = x + esize;
for ( ; x < z; x++, y++ )
SWAP(char, *x, *y);
}

You are optimizing the wrong thing, both of those should be so fast that you'll have to run them billions of times just to get any measurable difference.
And just about anything will have much greater effect on your performance, for example, if the values you are swapping are close in memory to the last value you touched they are lily to be in the processor cache, otherwise you'll have to access the memory - and that is several orders of magnitude slower then any operation you do inside the processor.
Anyway, your bottleneck is much more likely to be an inefficient algorithm or inappropriate data structure (or communication overhead) then how you swap numbers.

The only way to really know is to test it, and the answer may even vary depending on what compiler and platform you are on. Modern compilers are really good at optimizing code these days, and you should never try to outsmart the compiler unless you can prove that your way is really faster.
With that said, you'd better have a damn good reason to choose #2 over #1. The code in #1 is far more readable and because of that should always be chosen first. Only switch to #2 if you can prove that you need to make that change, and if you do - comment it to explain what's happening and why you did it the non-obvious way.
As an anecdote, I work with a couple of people that love to optimize prematurely and it makes for really hideous, unmaintainable code. I'm also willing to bet that more often than not they're shooting themselves in the foot because they've hamstrung the ability of the compiler to optimize the code by writing it in a non-straightforward way.

For modern CPU architectures, method 1 will be faster, also with higher readability than method 2.
On modern CPU architectures, the XOR technique is considerably slower than using a temporary variable to do swapping. One reason is that modern CPUs strive to execute instructions in parallel via instruction pipelines. In the XOR technique, the inputs to each operation depend on the results of the previous operation, so they must be executed in strictly sequential order. If efficiency is of tremendous concern, it is advised to test the speeds of both the XOR technique and temporary variable swapping on the target architecture. Check out here for more info.
Edit: Method 2 is a way of in-place swapping (i.e. without using extra variables). To make this question complete, I will add another in-place swapping by using +/-.
void swap(int* a, int* b)
{
if (a != b) // important to handle a/b share the same reference
{
*a = *a+*b;
*b = *a-*b;
*a = *a-*b;
}
}

I would not do it with pointers unless you have to. The compiler cannot optimize them very well because of the possibility of pointer aliasing (although if you can GUARANTEE that the pointers point to non-overlapping locations, GCC at least has extensions to optimize this).
And I would not do it with functions at all, since it's a very simple operation and the function call overhead is significant.
The best way to do it is with macros if raw speed and the possibility of optimization is what you require. In GCC you can use the typeof() builtin to make a flexible version that works on any built-in type.
Something like this:
#define swap(a,b) \
do { \
typeof(a) temp; \
temp = a; \
a = b; \
b = temp; \
} while (0)
...
{
int a, b;
swap(a, b);
unsigned char x, y;
swap(x, y); /* works with any type */
}
With other compilers, or if you require strict compliance with standard C89/99, you would have to make a separate macro for each type.
A good compiler will optimize this as aggressively as possible, given the context, if called with local/global variables as arguments.

All the top rated answers are not actually definitive "facts"... they are people who are speculating!
You can definitively know for a fact which code takes less assembly instructions to execute because you can look at the output assembly generated by the compiler and see which executes in less assembly instructions!
Here is the c code I compiled with flags "gcc -std=c99 -S -O3 lookingAtAsmOutput.c":
#include <stdio.h>
#include <stdlib.h>
void swap_traditional(int * restrict a, int * restrict b)
{
int temp = *a;
*a = *b;
*b = temp;
}
void swap_xor(int * restrict a, int * restrict b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
int main() {
int a = 5;
int b = 6;
swap_traditional(&a,&b);
swap_xor(&a,&b);
}
ASM output for swap_traditional() takes >>> 11 <<< instructions ( not including "leave", "ret", "size"):
.globl swap_traditional
.type swap_traditional, #function
swap_traditional:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
pushl %ebx
movl (%edx), %ebx
movl (%ecx), %eax
movl %ebx, (%ecx)
movl %eax, (%edx)
popl %ebx
popl %ebp
ret
.size swap_traditional, .-swap_traditional
.p2align 4,,15
ASM output for swap_xor() takes >>> 11 <<< instructions not including "leave" and "ret":
.globl swap_xor
.type swap_xor, #function
swap_xor:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
movl 12(%ebp), %edx
movl (%ecx), %eax
xorl (%edx), %eax
movl %eax, (%ecx)
xorl (%edx), %eax
xorl %eax, (%ecx)
movl %eax, (%edx)
popl %ebp
ret
.size swap_xor, .-swap_xor
.p2align 4,,15
Summary of assembly output:
swap_traditional() takes 11 instructions
swap_xor() takes 11 instructions
Conclusion:
Both methods use the same amount of instructions to execute and therefore are approximately the same speed on this hardware platform.
Lesson learned:
When you have small code snippets, looking at the asm output is helpful to rapidly iterate your code and come up with the fastest ( i.e. least instructions ) code. And you can save time even because you don't have to run the program for each code change. You only need to run the code change at the end with a profiler to show that your code changes are faster.
I use this method a lot for heavy DSP code that needs speed.

To answer your question as stated would require digging into the instruction timings of the particular CPU that this code will be running on which therefore require me to make a bunch of assumptions around the state of the caches in the system and the assembly code emitted by the compiler. It would be an interesting and useful exercise from the perspective of understanding how your processor of choice actually works but in the real world the difference will be negligible.

x=x+y-(y=x);
float x; cout << "X:"; cin >> x;
float y; cout << "Y:" ; cin >> y;
cout << "---------------------" << endl;
cout << "X=" << x << ", Y=" << y << endl;
x=x+y-(y=x);
cout << "X=" << x << ", Y=" << y << endl;

In my opinion local optimizations like this should only be considered tightly related to the platform. It makes a huge difference if you are compiling this on a 16 bit uC compiler or on gcc with x64 as target.
If you have a specific target in mind then just try both of them and look at the generated asm code or profile your applciation with both methods and see which is actually faster on your platform.

If you can use some inline assembler and do the following (psuedo assembler):
PUSH A
A=B
POP B
You will save a lot of parameter passing and stack fix up code etc.

I just placed both swaps (as macros) in hand written quicksort I've been playing with. The XOR version was much faster (0.1sec) then the one with the temporary variable (0.6sec). The XOR did however corrupt the data in the array (probably the same address thing Ant mentioned).
As it was a fat pivot quicksort, the XOR version's speed is probably from making large portions of the array the same. I tried a third version of swap which was the easiest to understand and it had the same time as the single temporary version.
acopy=a;
bcopy=b;
a=bcopy;
b=acopy;
[I just put an if statements around each swap, so it won't try to swap with itself, and the XOR now takes the same time as the others (0.6 sec)]

If your compiler supports inline assembler and your target is 32-bit x86 then the XCHG instruction is probably the best way to do this... if you really do care that much about performance.
Here is a method which works with MSVC++:
#include <stdio.h>
#define exchange(a,b) __asm mov eax, a \
__asm xchg eax, b \
__asm mov a, eax
int main(int arg, char** argv)
{
int a = 1, b = 2;
printf("%d %d --> ", a, b);
exchange(a,b)
printf("%d %d\r\n", a, b);
return 0;
}

void swap(int* a, int* b)
{
*a = (*b - *a) + (*b = *a);
}
// My C is a little rusty, so I hope I got the * right :)

Below piece of code will do the same. This snippet is optimized way of programming as it doesn't use any 3rd variable.
x = x ^ y;
y = x ^ y;
x = x ^ y;

Another beautiful way.
#define Swap( a, b ) (a)^=(b)^=(a)^=(b)
Advantage
No need of function call and handy.
Drawback:
This fails when both inputs are same variable. It can be used only on integer variables.

Related

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

Working inline assembly in C for bit parity?

I'm trying to compute the bit parity of a large number of uint64's. By bit parity I mean a function that accepts a uint64 and outputs 0 if the number of set bits is even, and 1 otherwise.
Currently I'm using the following function (by #Troyseph, found here):
uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111) * 0x1111111111111111;
return (n >> 60) & 1;
}
The same SO page has the following assembly routine (by #papadp):
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
which takes advantage of the machine's parity flag. But I cannot get it to work with my C program (I know next to no assembly).
Question. How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
(I'm using GCC with 64-bit Ubuntu 14 on an Intel Xeon Haswell)
In case it's of any help, the parity64() function is called inside the following routine:
uint bindot(uint64* a, uint64* b, uint64 entries){
uint parity = 0;
for(uint i=0; i<entries; ++i)
parity ^= parity64(a[i] & b[i]); // Running sum!
return parity;
}
(This is supposed to be the "dot product" of two vectors over the field Z/2Z, aka. GF(2).)
This may sound a bit harsh, but I believe it needs to be said. Please don't take it personally; I don't mean it as an insult, especially since you already admitted that you "know next to no assembly." But if you think code like this:
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
will beat what a C compiler generates, then you really have no business using inline assembly. In just those 5 lines of code, I see 2 instructions that are glaringly sub-optimal. It could be optimized by just rewriting it slightly:
xor eax, eax
test ecx, ecx ; logically, should use RCX, but see below for behavior of PF
jnp jmp_over
mov eax, 1 ; or possibly even "inc eax"; would need to verify
jmp_over:
ret
Or, if you have random input values that are likely to foil the branch predictor (i.e., there is no predictable pattern to the parity of the input values), then it would be faster yet to remove the branch, writing it as:
xor eax, eax
test ecx, ecx
setp al
ret
Or perhaps the equivalent (which will be faster on certain processors, but not necessarily all):
xor eax, eax
test ecx, ecx
mov ecx, 1
cmovp eax, ecx
ret
And these are just the improvements I could see off the top of my head, given my existing knowledge of the x86 ISA and previous benchmarks that I have conducted. But lest anyone be fooled, this is undoubtedly not the fastest code, because (borrowing from Michael Abrash), "there ain't no such thing as the fastest code"—someone can virtually always make it faster yet.
There are enough problems with using inline assembly when you're an expert assembly-language programmer and a wizard when it comes to the intricacies of the x86 ISA. Optimizers are pretty darn good nowadays, which means it's hard enough for a true guru to produce better code (though certainly not impossible). It also takes trustworthy benchmarks that will verify your assumptions and confirm that your optimized inline assembly is actually faster. Never commit yourself to using inline assembly to outsmart the compiler's optimizer without running a good benchmark. I see no evidence in your question that you've done anything like this. I'm speculating here, but it looks like you saw that the code was written in assembly and assumed that meant it would be faster. That is rarely the case. C compilers ultimately emit assembly language code, too, and it is often more optimal than what us humans are capable of producing, given a finite amount of time and resources, much less limited expertise.
In this particular case, there is a notion that inline assembly will be faster than the C compiler's output, since the C compiler won't be able to intelligently use the x86 architecture's built-in parity flag (PF) to its benefit. And you might be right, but it's a pretty shaky assumption, far from universalizable. As I've said, optimizing compilers are pretty smart nowadays, and they do optimize to a particular architecture (assuming you specify the right options), so it would not at all surprise me that an optimizer would emit code that used PF. You'd have to look at the disassembly to see for sure.
As an example of what I mean, consider the highly specialized BSWAP instruction that x86 provides. You might naïvely think that inline assembly would be required to take advantage of it, but it isn't. The following C code compiles to a BSWAP instruction on almost all major compilers:
uint32 SwapBytes(uint32 x)
{
return ((x << 24) & 0xff000000 ) |
((x << 8) & 0x00ff0000 ) |
((x >> 8) & 0x0000ff00 ) |
((x >> 24) & 0x000000ff );
}
The performance will be equivalent, if not better, because the optimizer has more knowledge about what the code does. In fact, a major benefit this form has over inline assembly is that the compiler can perform constant folding with this code (i.e., when called with a compile-time constant). Plus, the code is more readable (at least, to a C programmer), much less error-prone, and considerably easier to maintain than if you'd used inline assembly. Oh, and did I mention it's reasonably portable if you ever wanted to target an architecture other than x86?
I know I'm making a big deal of this, and I want you to understand that I say this as someone who enjoys the challenge of writing highly-tuned assembly code that beats the compiler's optimizer in performance. But every time I do it, it's just that: a challenge, which comes with sacrifices. It isn't a panacea, and you need to remember to check your assumptions, including:
Is this code actually a bottleneck in my application, such that optimizing it would even make any perceptible difference?
Is the optimizer actually emitting sub-optimal machine language instructions for the code that I have written?
Am I wrong in what I naïvely think is sub-optimal? Maybe the optimizer knows more than I do about the target architecture, and what looks like slow or sub-optimal code is actually faster. (Remember that less code is not necessarily faster.)
Have I tested it in a meaningful, real-world benchmark, and proven that the compiler-generated code is slow and that my inline assembly is actually faster?
Is there absolutely no way that I can tweak the C code to persuade the optimizer to emit better machine code that is close, equal to, or even superior to the performance of my inline assembly?
In an attempt to answer some of these questions, I set up a little benchmark. (Using MSVC, because that's what I have handy; if you're targeting GCC, it's best to use that compiler, but we can still get a general idea. I use and recommend Google's benchmarking library.) And I immediately ran into problems. See, I first run my benchmarks in "debugging" mode, with assertions compiled in that verify that my "tweaked"/"optimized" code is actually producing the same results for all test cases as the original code (that is presumably known to be working/correct). In this case, an assertion immediately fired. It turns out that the CheckParity routine written in assembly language does not return identical results to the parity64 routine written in C! Uh-oh. Well, that's another bullet we need to add to the above list:
Have I ensured that my "optimized" code is returning the correct results?
This one is especially critical, because it's easy to make something faster if you also make it wrong. :-) I jest, but not entirely, because I've done this many times in the pursuit of faster code.
I believe Michael Petch has already pointed out the reason for the discrepancy: in the x86 implementation, the parity flag (PF) only concerns itself with the bits in the low byte, not the entire value. If that's all you need, then great. But even then, we can go back to the C code and further optimize it to do less work, which will make it faster—perhaps faster than the assembly code, eliminating the one advantage that inline assembly ever had.
For now, let's assume that you need the parity of the full value, since that's the original implementation you had that was working, and you're just trying to make it faster without changing its behavior. Thus, we need to fix the assembly code's logic before we can even proceed with meaningfully benchmarking it. Fortunately, since I am writing this answer late, Ajay Brahmakshatriya (with collaboration from others) has already done that work, saving me the extra effort.
…except, not quite. When I first drafted this answer, my benchmark revealed that draft 9 of his "tweaked" code still did not produce the same result as the original C function, so it's unsuitable according to our test cases. You say in a comment that his code "works" for you, which means either (A) the original C code was doing extra work, making it needlessly slow, meaning that you can probably tweak it to beat the inline assembly at its own game, or worse, (B) you have insufficient test cases and the new "optimized" code is actually a bug lying in wait. Since that time, Ped7g suggested a couple of fixes, which both fixed the bug causing the incorrect result to be returned, and further improved the code. The amount of input required here, and the number of drafts that he has gone through, should serve as testament to the difficulty of writing correct inline assembly to beat the compiler. But we're not even done yet! His inline assembly remains incorrectly written. SETcc instructions require an 8-bit register as their operand, but his code doesn't use a register specifier to request that, meaning that the code either won't compile (because Clang is smart enough to detect this error) or will compile on GCC but won't execute properly because that instruction has an invalid operand.
Have I convinced you about the importance of testing yet? I'll take it on faith, and move on to the benchmarking part. The benchmark results use the final draft of Ajay's code, with Ped7g's improvements, and my additional tweaks. I also compare some of the other solutions from that question you linked, modified for 64-bit integers, plus a couple of my own invention. Here are my benchmark results (mobile Haswell i7-4850HQ):
Benchmark Time CPU Iterations
-------------------------------------------------------------------
Naive 36 ns 36 ns 19478261
OriginalCCode 4 ns 4 ns 194782609
Ajay_Brahmakshatriya_Tweaked 4 ns 4 ns 194782609
Shreyas_Shivalkar 37 ns 37 ns 17920000
TypeIA 5 ns 5 ns 154482759
TypeIA_Tweaked 4 ns 4 ns 160000000
has_even_parity 227 ns 229 ns 3200000
has_even_parity_Tweaked 36 ns 36 ns 19478261
GCC_builtin_parityll 4 ns 4 ns 186666667
PopCount 3 ns 3 ns 248888889
PopCount_Downlevel 5 ns 5 ns 100000000
Now, keep in mind that these are for randomly-generated 64-bit input values, which disrupts branch prediction. If your input values are biased in a predictable way, either towards parity or non-parity, then the branch predictor will work for you, rather than against you, and certain approaches may be faster. This underscores the importance of benchmarking against data that simulates real-world use cases. (That said, when I write general library functions, I tend to optimize for random inputs, balancing size and speed.)
Notice how the original C function compares to the others. I'm going to make the claim that optimizing it any further is probably a big fat waste of time. So hopefully you learned something more general from this answer, rather than just scrolled down to copy-paste the code snippets. :-)
The Naive function is a completely unoptimized sanity check to determine the parity, taken from here. I used it to validate even your original C code, and also to provide a baseline for the benchmarks. Since it loops through each bit, one-by-one, it is relatively slow, as expected:
unsigned int Naive(uint64 n)
{
bool parity = false;
while (n)
{
parity = !parity;
n &= (n - 1);
}
return parity;
}
OriginalCCode is exactly what it sounds like—it's the original C code that you had, as shown in the question. Notice how it posts up at exactly the same time as the tweaked/corrected version of Ajay Brahmakshatriya's inline assembly code! Now, since I ran this benchmark in MSVC, which doesn't support inline assembly for 64-bit builds, I had to use an external assembly module containing the function, and call it from there, which introduced some additional overhead. With GCC's inline assembly, the compiler probably would have been able to inline the code, thus eliding a function call. So on GCC, you might see the inline-assembly version be up to a nanosecond faster (or maybe not). Is that worth it? You be the judge. For reference, this is the code I tested for Ajay_Brahmakshatriya_Tweaked:
Ajay_Brahmakshatriya_Tweaked PROC
mov rax, rcx ; Windows 64-bit calling convention passes parameter in ECX (System V uses EDI)
shr rax, 32
xor rcx, rax
mov rax, rcx
shr rax, 16
xor rcx, rax
mov rax, rcx
shr rax, 8
xor eax, ecx ; Ped7g's TEST is redundant; XOR already sets PF
setnp al
movzx eax, al
ret
Ajay_Brahmakshatriya_Tweaked ENDP
The function named Shreyas_Shivalkar is from his answer here, which is just a variation on the loop-through-each-bit theme, and is, in keeping with expectations, slow:
Shreyas_Shivalkar PROC
; unsigned int parity = 0;
; while (x != 0)
; {
; parity ^= x;
; x >>= 1;
; }
; return (parity & 0x1);
xor eax, eax
test rcx, rcx
je SHORT Finished
Process:
xor eax, ecx
shr rcx, 1
jne SHORT Process
Finished:
and eax, 1
ret
Shreyas_Shivalkar ENDP
TypeIA and TypeIA_Tweaked are the code from this answer, modified to support 64-bit values, and my tweaked version. They parallelize the operation, resulting in a significant speed improvement over the loop-through-each-bit strategy. The "tweaked" version is based on an optimization originally suggested by Mathew Hendry to Sean Eron Anderson's Bit Twiddling Hacks, and does net us a tiny speed-up over the original.
unsigned int TypeIA(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n ^= n >> 2;
n ^= n >> 1;
return !((~n) & 1);
}
unsigned int TypeIA_Tweaked(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n &= 0xf;
return ((0x6996 >> n) & 1);
}
has_even_parity is based on the accepted answer to that question, modified to support 64-bit values. I knew this would be slow, since it's yet another loop-through-each-bit strategy, but obviously someone thought it was a good approach. It's interesting to see just how slow it actually is, even compared to what I termed the "naïve" approach, which does essentially the same thing, but faster, with less-complicated code.
unsigned int has_even_parity(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
if (n & (b << i)) { ++count; }
}
return (count % 2);
}
has_even_parity_Tweaked is an alternate version of the above that saves a branch by taking advantage of the fact that Boolean values are implicitly convertible into 0 and 1. It is substantially faster than the original, clocking in at a time comparable to the "naïve" approach:
unsigned int has_even_parity_Tweaked(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
count += static_cast<int>(static_cast<bool>(n & (b << i)));
}
return (count % 2);
}
Now we get into the good stuff. The function GCC_builtin_parityll consists of the assembly code that GCC would emit if you used its __builtin_parityll intrinsic. Several others have suggested that you use this intrinsic, and I must echo their endorsement. Its performance is on par with the best we've seen so far, and it has a couple of additional advantages: (1) it keeps the code simple and readable (simpler than the C version); (2) it is portable to different architectures, and can be expected to remain fast there, too; (3) as GCC improves its implementation, your code may get faster with a simple recompile. You get all the benefits of inline assembly, without any of the drawbacks.
GCC_builtin_parityll PROC ; GCC's __builtin_parityll
mov edx, ecx
shr rcx, 32
xor edx, ecx
mov eax, edx
shr edx, 16
xor eax, edx
xor al, ah
setnp al
movzx eax, al
ret
GCC_builtin_parityll ENDP
PopCount is an optimized implementation of my own invention. To come up with this, I went back and considered what we were actually trying to do. The definition of "parity" is an even number of set bits. Therefore, it can be calculated simply by counting the number of set bits and testing to see if that count is even or odd. That's two logical operations. As luck would have it, on recent generations of x86 processors (Intel Nehalem or AMD Barcelona, and newer), there is an instruction that counts the number of set bits—POPCNT (population count, or Hamming weight)—which allows us to write assembly code that does this in two operations.
(Okay, actually three instructions, because there is a bug in the implementation of POPCNT on certain microarchitectures that creates a false dependency on its destination register, and to ensure we get maximum throughput from the code, we need to break this dependency by pre-clearing the destination register. Fortunately, this a very cheap operation, one that can generally be handled for "free" by register renaming.)
PopCount PROC
xor eax, eax ; break false dependency
popcnt rax, rcx
and eax, 1
ret
PopCount ENDP
In fact, as it turns out, GCC knows to emit exactly this code for the __builtin_parityll intrinsic when you target a microarchitecture that supports POPCNT (otherwise, it uses the fallback implementation shown below). As you can see from the benchmarks, this is the fastest code yet. It isn't a major difference, so it's unlikely to matter unless you're doing this repeatedly within a tight loop, but it is a measurable difference and presumably you wouldn't be optimizing this so heavily unless your profiler indicated that this was a hot-spot.
But the POPCNT instruction does have the drawback of not being available on older processors, so I also measured a "fallback" version of the code that does a population count with a sequence of universally-supported instructions. That is the PopCount_Downlevel function, taken from my private library, originally adapted from this answer and other sources.
PopCount_Downlevel PROC
mov rax, rcx
shr rax, 1
mov rdx, 5555555555555555h
and rax, rdx
sub rcx, rax
mov rax, 3333333333333333h
mov rdx, rcx
and rcx, rax
shr rdx, 2
and rdx, rax
add rdx, rcx
mov rcx, 0FF0F0F0F0F0F0F0Fh
mov rax, rdx
shr rax, 4
add rax, rdx
mov rdx, 0FF01010101010101h
and rax, rcx
imul rax, rdx
shr rax, 56
and eax, 1
ret
PopCount_Downlevel ENDP
As you can see from the benchmarks, all of the bit-twiddling instructions that are required here exact a cost in performance. It is slower than POPCNT, but supported on all systems and still reasonably quick. If you needed a bit count anyway, this would be the best solution, especially since it can be written in pure C without the need to resort to inline assembly, potentially yielding even more speed:
unsigned int PopCount_Downlevel(uint64 n)
{
uint64 temp = n - ((n >> 1) & 0x5555555555555555ULL);
temp = (temp & 0x3333333333333333ULL) + ((temp >> 2) & 0x3333333333333333ULL);
temp = (temp + (temp >> 4)) & 0x0F0F0F0F0F0F0F0FULL;
temp = (temp * 0x0101010101010101ULL) >> 56;
return (temp & 1);
}
But run your own benchmarks to see if you wouldn't be better off with one of the other implementations, like OriginalCCode, which simplifies the operation and thus requires fewer total instructions. Fun fact: Intel's compiler (ICC) always uses a population count-based algorithm to implement __builtin_parityll; it emits a POPCNT instruction if the target architecture supports it, or otherwise, it simulates it using essentially the same code as I've shown here.
Or, better yet, just forget the whole complicated mess and let your compiler deal with it. That's what built-ins are for, and there's one for precisely this purpose.
Because C sucks when handling bit operations, I suggest using gcc built in functions, in this case __builtin_parityl(). See:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
You will have to use extended inline assembly (which is a gcc extension) to get the similar effect.
Your parity64 function can be changed as follows -
uint parity64_unsafe_and_broken(uint64 n){
uint result = 0;
__asm__("addq $0, %0" : : "r"(n) :);
// editor's note: compiler-generated instructions here can destroy EFLAGS
// Don't depending on FLAGS / regs surviving between asm statements
// also, jumping out of an asm statement safely requires asm goto
__asm__("jnp 1f");
__asm__("movl $1, %0" : "=r"(result) : : );
__asm__("1:");
return result;
}
But as commented by #MichaelPetch the parity flag is computed only on the lower 8 bits. So this will work for your if your n is less than 255. For bigger numbers you will have to use the code you mentioned in your question.
To get it working for 64 bits you can collapse the parity of the 32 bit integer into single byte by doing
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
This code will have to be just at the start of the function before the assembly.
You will have to check how it affects the performance.
The most optimized I could get it is
uint parity64(uint64 n){
unsigned char result = 0;
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
__asm__("test %1, %1 \n\t"
"setp %0"
: "+r"(result)
: "r"(n)
:
);
return result;
}
How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
This is an XY problem... You think you need to inline that assembly to gain from its benefits, so you asked about how to inline it... but you don't need to inline it.
You shouldn't include assembly into your C source code, because in this case you don't need to, and the better alternative (in terms of portability and maintainability) is to keep the two pieces of source code separate, compile them separately and use the linker to link them.
In parity64.c you should have your portable version (with a wrapper named bool CheckParity(size_t result)), which you can default to in non-x86/64 situations.
You can compile this to an object file like so: gcc -c parity64.c -o parity64.o
... and then link the object code generated from assembly, with the C code: gcc bindot.c parity64.o -o bindot
In parity64_x86.s you might have the following assembly code from your question:
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
You can compile this to an alternative parity64.o object file object code using gcc with this command: gcc -c parity64_x86.s -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
Similarly, if you wanted to use __builtin_parityl instead (as suggested by hdantes answer, you could (and should) once again keep that code separate (in the same place you keep other gcc/x86 optimisations) from your portable code. In parity64_x86.c you might have:
bool CheckParity(size_t result) {
return __builtin_parityl(result);
}
To compile this, your command would be: gcc -c parity64_x86.c -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
On a side-note, if you'd like to inspect the assembly gcc would produce from this: gcc -S parity64_x86.c
Comments in your assembly indicate that the equivalent function prototype in C would be bool CheckParity(size_t Result), so with that in mind, here's what bindot.c might look like:
extern bool CheckParity(size_t Result);
uint64_t bindot(uint64_t *a, uint64_t *b, size_t entries){
uint64_t parity = 0;
for(size_t i = 0; i < entries; ++i)
parity ^= a[i] & b[i]; // Running sum!
return CheckParity(parity);
}
You can build this and link it to any of the above parity64.o versions like so: gcc bindot.c parity64.o -o bindot...
I highly recommend reading the manual for your compiler, when you have the time...

*str and *str++

I have this code (my strlen function)
size_t slen(const char *str)
{
size_t len = 0;
while (*str)
{
len++;
str++;
}
return len;
}
Doing while (*str++), as shown below, the program execution time is much larger:
while (*str++)
{
len++;
}
I'm doing this to probe the code
int main()
{
double i = 11002110;
const char str[] = "long string here blablablablablablablabla"
while (i--)
slen(str);
return 0;
}
In first case the execution time is around 6.7 seconds, while in the second (using *str++), the time is around 10 seconds!
Why so much difference?
Probably because the post-increment operator (used in the condition of the while statement) involves keeping a temporary copy of the variable with its old value.
What while (*str++) really means is:
while (tmp = *str, ++str, tmp)
...
By contrast, when you write str++; as a single statement in the body of the while loop, it is in a void context, hence the old value isn't fetched because it's not needed.
To summarise, in the *str++ case you have an assignment, 2 increments, and a jump in each iteration of the loop. In the other case you only have 2 increments and a jump.
Trying this out on ideone.com, I get about 0.5s execution with *str++ here. Without, it takes just over a second (here). Using *str++ was faster. Perhaps with optimisation on *str++ can be done more efficiently.
This depends on your compiler, compiler flags, and your architecture. With Apple's LLVM gcc 4.2.1, I don't get a noticeable change in performance between the two versions, and there really shouldn't be. A good compiler would turn the *str version into something like
IA-32 (AT&T Syntax):
slen:
pushl %ebp # Save old frame pointer
movl %esp, %ebp # Initialize new frame pointer
movl -4(%ebp), %ecx # Load str into %ecx
xor %eax, %eax # Zero out %eax to hold len
loop:
cmpb (%ecx), $0 # Compare *str to 0
je done # If *str is NUL, finish
incl %eax # len++
incl %ecx # str++
j loop # Goto next iteration
done:
popl %ebp # Restore old frame pointer
ret # Return
The *str++ version could be compiled exactly the same (since changes to str aren't visible outside slen, when the increment actually occurs isn't important), or the body of the loop could be:
loop:
incl %ecx # str++
cmpb -1(%ecx), $0 # Compare *str to 0
je done # If *str is NUL, finish
incl %eax # len++
j loop # Goto next iteration
Others have already provided some excellent commentary, including analysis for the generated assembly code. I strongly recommend that you read them carefully. As they have pointed out this sort of question can't really be answered without some quantification, so let's and play with it a bit.
First, we're going to need a program. Our plan is this: we will generate strings whose lengths are powers of two, and try all functions in turn. We run through once to prime the cache and then separately time 4096 iterations using the highest-resolution available to us. Once we are done, we will calculate some basic statistics: min, max and the simple-moving average and dump it. We can then do some rudimentary analysis.
In addition to the two algorithms you've already shown, I will show a third option which doesn't involve the use of a counter at all, relying instead on a subtraction, and I'll mix things up by throwing in std::strlen, just to see what happens. It'll be an interesting throwdown.
Through the magic of television our little program is already written, so we compile it with gcc -std=c++11 -O3 speed.c and we get cranking producing some data. I've done two separate graphs, one for strings whose size is from 32 to 8192 bytes and another for strings whose size is from 16384 all the way to 1048576 bytes long. In the following graphs, the Y axis is the time consumed in nanoseconds and the X axis shows the length of the string in bytes.
Without further ado, let's look at performance for "small" strings from 32 to 8192 bytes:
Now this is interesting. Not only is the std::strlen function outperforming everything across the board, it's doing it with gusto too since it's performance is a lot of more stable.
Will the situation change if we look at larger strings, from 16384 all the way to 1048576 bytes long?
Sort of. The difference is becoming even more pronounced. As our custom-written functions huff-and-puff, std::strlen continues to perform admirably.
An interesting observation to make is that you can't necessarily translate number of C++ instructions (or even, number of assembly instructions) to performance, since functions whose bodies consist of fewer instructions sometimes take longer to execute.
An even more interesting -- and important observation is to notice just how well the str::strlen function performs.
So what does all this get us?
First conclusion: don't reinvent the wheel. Use the standard functions available to you. Not only are they already written, but they are very very heavily optimized and will almost certainly outperform anything you can write unless you're Agner Fog.
Second conclusion: unless you have hard data from a profiler that a particular section of code or function is hot-spot in your application, don't bother optimizing code. Programmers are notoriously bad at detecting hot-spots by looking at high level function.
Third conclusion: prefer algorithmic optimizations in order to improve your code's performance. Put your mind to work and let the compiler shuffle bits around.
Your original question was: "why is function slen2 slower than slen1?" I could say that it isn't easy to answer without a lot more information, and even then it might be a lot longer and more involved than you care for. Instead what I'll say is this:
Who cares why? Why are you even bothering with this? Use std::strlen - which is better than anything that you can rig up - and move on to solving more important problems - because I'm sure that this isn't the biggest problem in your application.

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

Faster approach to checking for an all-zero buffer in C?

I am searching for a faster method of accomplishing this:
int is_empty(char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[i] != 0) return 0;
}
return 1;
}
I realize I'm searching for a micro optimization unnecessary except in extreme cases, but I know a faster method exists, and I'm curious what it is.
On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.
Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.
Expressing this is difficult in portable C: casting a char* to long* violates strict aliasing. But fortunately you can use memcpy to portably express an unaligned multi-byte load that can alias anything. Compilers will optimize it to the asm you want.
For example, this work-in-progress implementation (https://godbolt.org/z/3hXQe7) on the Godbolt compiler explorer shows that you can get a good inner loop (with some startup overhead) from loading two consecutive uint_fast32_t vars (often 64-bit) with memcpy and then checking tmp1 | tmp2, because many CPUs will set flags according to an OR result, so this lets you check two words for the price of one.
Getting it to compile efficiently for targets without efficient unaligned loads requires some manual alignment in the startup code, and even then gcc may not inline the memcpy for loads where it can't prove alignment.
One potential way, inspired by Kieveli's dismissed idea:
int is_empty(char *buf, size_t size)
{
static const char zero[999] = { 0 };
return !memcmp(zero, buf, size > 999 ? 999 : size);
}
Note that you can't make this solution work for arbitrary sizes. You could do this:
int is_empty(char *buf, size_t size)
{
char *zero = calloc(size);
int i = memcmp(zero, buf, size);
free(zero);
return i;
}
But any dynamic memory allocation is going to be slower than what you have. The only reason the first solution is faster is because it can use memcmp(), which is going to be hand-optimized in assembly language by the library writers and will be much faster than anything you could code in C.
EDIT: An optimization no one else has mentioned, based on earlier observations about the "likelyness" of the buffer to be in state X: If a buffer isn't empty, will it more likely not be empty at the beginning or the end? If it's more likely to have cruft at the end, you could start your check at the end and probably see a nice little performance boost.
EDIT 2: Thanks to Accipitridae in the comments:
int is_empty(char *buf, size_t size)
{
return buf[0] == 0 && !memcmp(buf, buf + 1, size - 1);
}
This basically compares the buffer to itself, with an initial check to see if the first element is zero. That way, any non-zero elements will cause memcmp() to fail. I don't know how this would compare to using another version, but I do know that it will fail quickly (before we even loop) if the first element is nonzero. If you're more likely to have cruft at the end, change buf[0] to buf[size] to get the same effect.
The benchmarks given above (https://stackoverflow.com/a/1494499/2154139) are not accurate. They imply that func3 is much faster than the other options.
However, if you change the order of the tests, so that func3 comes before func2, you'd see func2 is much faster.
Careful when running combination benchmarks within a single execution... the side effects are large, especially when reusing the same variables. Better to run the tests isolated!
For example, changing it to:
int main(){
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
}
gives me:
func3: zero 14243
func3: zero 1142
func3: zero 885
func3: zero 848
func3: zero 870
This was really bugging me as I couldn't see how func3 could perform so much faster than func2.
(apologize for the answer, and not as a comment, didn't have reputation)
Four functions for testing zeroness of a buffer with simple benchmarking:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <inttypes.h>
#define SIZE (8*1024)
char zero[SIZE] __attribute__(( aligned(8) ));
#define RDTSC(var) __asm__ __volatile__ ( "rdtsc" : "=A" (var));
#define MEASURE( func ) { \
uint64_t start, stop; \
RDTSC( start ); \
int ret = func( zero, SIZE ); \
RDTSC( stop ); \
printf( #func ": %s %12"PRIu64"\n", ret?"non zero": "zero", stop-start ); \
}
int func1( char *buff, size_t size ){
while(size--) if(*buff++) return 1;
return 0;
}
int func2( char *buff, size_t size ){
return *buff || memcmp(buff, buff+1, size-1);
}
int func3( char *buff, size_t size ){
return *(uint64_t*)buff || memcmp(buff, buff+sizeof(uint64_t), size-sizeof(uint64_t));
}
int func4( char *buff, size_t size ){
return *(wchar_t*)buff || wmemcmp((wchar_t*)buff, (wchar_t*)buff+1, size/sizeof(wchar_t)-1);
}
int main(){
MEASURE( func1 );
MEASURE( func2 );
MEASURE( func3 );
MEASURE( func4 );
}
Result on my old PC:
func1: zero 108668
func2: zero 38680
func3: zero 8504
func4: zero 24768
If your program is x86 only or x64 only, you can easily optimize using inline assambler. The REPE SCASD instruction will scan a buffer until a non EAX dword is found.
Since there is no equivalent standard library function, no compiler/optimizer will probably be able to use these instructions (as Confirmed by Sufian's code).
From the head, something like this would do if your buffer length is 4-bytes aligned (MASM syntax):
_asm {
CLD ; search forward
XOR EAX, EAX ; search for non-zero
LEA EDI, [buf] ; search in buf
MOV ECX, [buflen] ; search buflen bytes
SHR ECX, 2 ; using dwords so len/=4
REPE SCASD ; perform scan
JCXZ bufferEmpty: ; completes? then buffer is 0
}
Tomas
EDIT: updated with Tony D's fixes
For something so simple, you'll need to see what code the compiler is generating.
$ gcc -S -O3 -o empty.s empty.c
And the contents of the assembly:
.text
.align 4,0x90
.globl _is_empty
_is_empty:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %edx ; edx = pointer to buffer
movl 8(%ebp), %ecx ; ecx = size
testl %edx, %edx
jle L3
xorl %eax, %eax
cmpb $0, (%ecx)
jne L5
.align 4,0x90
L6:
incl %eax ; real guts of the loop are in here
cmpl %eax, %edx
je L3
cmpb $0, (%ecx,%eax) ; compare byte-by-byte of buffer
je L6
L5:
leave
xorl %eax, %eax
ret
.align 4,0x90
L3:
leave
movl $1, %eax
ret
.subsections_via_symbols
This is very optimized. The loop does three things:
Increase the offset
Compare the offset to the size
Compare the byte-data in memory at base+offset to 0
It could be optimized slightly more by comparing at a word-by-word basis, but then you'd need to worry about alignment and such.
When all else fails, measure first, don't guess.
Try checking the buffer using an int-sized variable where possible (it should be aligned).
Off the top of my head (uncompiled, untested code follows - there's almost certainly at least one bug here. This just gives the general idea):
/* check the start of the buf byte by byte while it's unaligned */
while (size && !int_aligned( buf)) {
if (*buf != 0) {
return 0;
}
++buf;
--size;
}
/* check the bulk of the buf int by int while it's aligned */
size_t n_ints = size / sizeof( int);
size_t rem = size / sizeof( int);
int* pInts = (int*) buf;
while (n_ints) {
if (*pInt != 0) {
return 0;
}
++pInt;
--n_ints;
}
/* now wrap up the remaining unaligned part of the buf byte by byte */
buf = (char*) pInts;
while (rem) {
if (*buf != 0) {
return 0;
}
++buf;
--rem;
}
return 1;
With x86 you can use SSE to test 16 bytes at a time:
#include "smmintrin.h" // note: requires SSE 4.1
int is_empty(const char *buf, const size_t size)
{
size_t i;
for (i = 0; i + 16 <= size; i += 16)
{
__m128i v = _mm_loadu_si128((m128i *)&buf[i]);
if (!_mm_testz_si128(v, v))
return 0;
}
for ( ; i < size; ++i)
{
if (buf[i] != 0)
return 0;
}
return 1;
}
This can probably be further improved with loop unrolling.
On modern x86 CPUs with AVX you can even use 256 bit SIMD and test 32 bytes at a time.
The Hackers Delight book/site is all about optimized C/assembly. Lots of good references from that site also and is fairly up to date (AMD64, NUMA techniques also).
Look at fast memcpy - it can be adapted for memcmp (or memcmp against a constant value).
I see a lot of people saying things about alignment issues preventing you from doing word sized accesses, but that's not always true. If you're looking to make portable code, then this is certainly an issue, however x86 will actually tolerate misaligned accesses. For exmaple this will only fail on the x86 if alignment checking is turned on in EFLAGS (and of course buf is actuallly not word aligned).
int is_empty(char * buf, int size) {
int i;
for(i = 0; i < size; i+= 4) {
if(*(int *)(buf + i) != 0) {
return 0;
}
}
for(; i < size; i++) {
if(buf[i] != 0)
return 0;
}
return 1;
}
Regardless the compiler CAN convert your original loop into a loop of word-based comparisons with extra jumps to handle alignment issues, however it will not do this at any normal optimization level because it lacks information. For cases when size is small, unrolling the loop in this way will make the code slower, and the compiler wants to be conservative.
A way to get around this is to make use of profile guided optimizations. If you let GCC get profile information on the is_empty function then re-compile it, it will be willing to unroll the loop into word-sized comparisons with an alignment check. You can also force this behavior with -funroll-all-loops
Did anyone mention unrolling the loop? In any of these loops, the loop overhead and indexing is going to be significant.
Also, what is the probability that the buffer will actually be empty? That's the only case where you have to check all of it.
If there typically is some garbage in the buffer, the loop should stop very early, so it doesn't matter.
If you plan to clear it to zero if it's not zero, it would probably be faster just to clear it with memset(buf, 0, sizeof(buf)), whether or not it's already zero.
What about looping from size to zero (cheaper checks):
int is_empty(char * buf, int size)
{
while(size --> 0) {
if(buf[size] != 0) return 0;
}
return 1;
}
It must be noted that we probably cannot outperform the compiler, so enable the most aggressive speed optimization in your compiler and assume that you're likely to not go any faster.
Or handling everything using pointers (not tested, but likely to perform quite good):
int is_empty(char* buf, int size)
{
char* org = buf;
if (buf[size-1] == 1)
return 0;
buf[size-1] = 1;
while(! *buf++);
buf--;
return buf == org[size-1];
}
You stated in your question that you are looking for a most likely unnecessary micro-optimization. In 'normal' cases the ASM approach by Thomas and others should give you the fastest results.
Still, this is forgetting the big picture. If your buffer is really large, then starting from the start and essential do a linear search is definitely not the fastest way to do this. Assume your cp replacement is quite good at finding large consecutive empty regions but has a few non-empty bytes at the end of the array. All linear searches would require reading the whole array. On the other hand a quicksort inspired algorithm could search for any non-zero elements and abort much faster for a large enough dataset.
So before doing any kind of micro-optimization I would look closely at the data in your buffer and see if that gives you any patterns. For a single '1', randomly distributed in the buffer a linear search (disregarding threading/parallelization) will be the fastest approach, in other cases not necessarily so.
Inline assembly version of the initial C code (no error checking, if uiSize is == 0 and/or the array is not allocated exceptions will be generated. Perhaps use try {} catch() as this might be faster than adding a lot of check to the code. Or do as I do, try not to call functions with invalid values (usually does not work). At least add a NULL pointer check and a size != 0 check, that is very easy.
unsigned int IsEmpty(char* pchBuffer, unsigned int uiSize)
{
asm {
push esi
push ecx
mov esi, [pchBuffer]
mov ecx, [uiSize]
// add NULL ptr and size check here
mov eax, 0
next_char:
repe scasb // repeat string instruction as long as BYTE ptr ds:[ESI] == 0
// scasb does pointer arithmetic for BYTES (chars), ie it copies a byte to al and increments ESI by 1
cmp cx,0 // did the loop complete?
je all_chars_zero // yes, array is all 0
jmp char_not_zero // no, loop was interrupted due to BYTE PTR ds:[ESI] != 0
all_chars_zero:
mov eax, 1 // Set return value (works in MASM)
jmp end
char_not_zero:
mov eax, 0 // Still not sure if this works in inline asm
end:
pop ecx
pop esi
}
}
That is written on the fly, but it looks correct enough, corrections are welcome. ANd if someone known about how to set the return value from inline asm, please do tell.
int is_empty(char * buf, int size)
{
int i, content=0;
for(i = 0; !content && i < size; i++)
{
content=content | buf(i); // bitwise or
}
return (content==0);
}
int is_empty(char * buf, int size)
{
return buf[0] == '\0';
}
If your buffer is not a character string, I think that's the fastest way to check...
memcmp() would require you to create a buffer the same size and then use memset to set it all as 0. I doubt that would be faster...
Edit: Bad answer
A novel approach might be
int is_empty(char * buf, int size) {
char start = buf[0];
char end = buff[size-1];
buf[0] = 'x';
buf[size-1] = '\0';
int result = strlen(buf) == 0;
buf[0] = start;
buff[size-1] = end;
return result;
}
Why the craziness? because strlen is one of the library function that's more likely to be optimized.
Storing and replacing the first character is to prevent the false positive. Storing and replacing the last character is to make sure it terminates.
The initial C algorithm is pretty much as slow as it can be in VALID C.
If you insist on using C then try a "while" loop instead of "for":
int i = 0;
while (i< MAX)
{
// operate on the string
i++;
}
This is pretty much the fastest 1 dimensional string operation loop you can write in C, besides if you can force the compiler to put i in a register with the "register" keyword, but I am told that this is almost always ignored by modern compilers.
Also searching a constant sized array to check if it is empty is very wasteful and also 0 is not empty, it is value in the array.
A better solution for speed would to use a dynamic array (int* piBuffer) and a variable that stores the current size (unsigned int uiBufferSize), when the array is empty then the pointer is NULL, and uiBufferSize is 0. Make a class with these two as protected member variables. One could also easily write a template for dynamic arrays, which would store 32 bit values, either primitive types or pointers, for primitive types there is not really any way to test for "empty" (I interpret this as "undefined"), but you can of course define 0 to represent an available entry. For an array pointers you should initialize all entries to NULL, and set entry to NULL when you have just deallocated that memory. And NULL DOES mean "points at nothing" so this is very convenient way to represent empty. One should not use dynamically resized arrays in really complicated algorithms, at least not in the development phase, there are simply too many things that can go wrong. One should at least first implement the algorithm using an STL Container (or well tested alternative) and then when the code works one can swap the tested container for a simple dynamic array (and if you can avoid resizing the array too often the code will both be faster and more fail safe.
A better solution for complicated and cool code is to use either std::vector or a std::map (or any container class STL, homegrown or 3rd party) depending on your needs, but looking at your code I would say that the std::vector is enough. The STL Containers are templates so they should be pretty fast too. Use STL Container to store object pointers (always store object pointers and not the actual objects, copying entire objects for every entry will really mess up your execution speed) and dynamic arrays for more basic data (bitmap, sound etc.) ie primitive types. Generally.
I came up with the REPE SCASW solution independtly by studying x86 assembly language manuals, and I agree that the example using this string operation instruction is the fastest. The other assembly example which has separate compare, jump etc. instructions is almost certainly slower (but still much faster than the initial C code, so still a good post), as the string operations are among the most highly optimized on all modern CPUs, they may even have their own logic circuitry (anyone knows?).
The REPE SCASD does not need to fetch a new instruction nor increase the instruction pointer, and that is just the stuff an assembly novice like me can come up with and and on top of that is the hardware optimization, string operations are critical for almost all kinds of modern software in particular multimedia application (copy PCM sound data, uncompressed bitmap data, etc.), so optimizing these instructions must have been very high priority every time a new 80x86 chip was being designed.
I use it for a novel 2d sprite collision algorithm.
It says that I am not allowed to have an opinion, so consider the following an objective assessment: Modern compilers (UNMANAGED C/C++, pretty much everything else is managed code and is slow as hell) are pretty good at optimizing, but it cannot be avoided that for VERY specific tasks the compiler generates redundant code. One could look at the assembly that the compiler outputs so that one does not have to translate a complicated algorithm entirely from scratch, even though it is very fun to do (for some) and it is much more rewarding doing code the hard way, but anyway, algorithms using "for" loops, in particular with regards to string operations, can often be optimized very significantly as the for loop generates a lot of code, that is often not needed, example:
for (int i = 1000; i>0; i--) DoSomething(); This line generates at 6-10 lines of assembly if the compiler is not very clever (it might be), but the optimized assembly version CAN be:
mov cx, 1000
_DoSomething:
// loop code....or call Func, slower but more readable
loop _DoSomething
That was 2 lines, and it does exactly the same as the C line (it uses registers instead of memory addresses, which is MUCH faster, but arguably this is not EXACTLY the same as the C line, but that is semantics) , how much of an optimization this example is depends on how well modern compilers optimize, which I have no clue on, but the algorithm analysis based on the goal of implementing an algorithm with the fewest and faster assembly lines often works well, I have had very good results with first implementing the algorithm in C/C++ without caring about optimization and then translate and optimize it in assembly. The fact that each C line becomes many assembly lines often makes some optimizations very obvious, and also some instructions are faster than others:
INC DX ; is faster than:
ADD DX,1 ;if ADD DX,1 is not just replaced with INC DX by the assembler or the CPU
LOOP ; is faster than manually decreasing, comparing and jumping
REPxx STOSx/MOVSx/LODSx is faster than using cmp, je/jne/jea etc and loop
JMP or conditional jumping is faster than using CALL, so in a loop that is executed VERY frequently (like rendering), including functions in the code so it is accessible with "local" jumps can also boost performance.
The last bit is very relevant for this question, fast string operations.
So this post is not all rambling.
And lastly, design you assembly algorithm in the way that requires the least amount of jumps for a typical execution.
Also don't bother optimizing code that is not called that often, use a profiler and see what code is called most often, and start with that, anything that is called less than 20 times a second (and completes much faster than 1000 ms/ 20) is not really worth optimizing. Look at code that it not synchronized to timers and the like and is executed again immediately after is has completed. On the other hand if your rendering loop can do 100+ FPS on a modest machine, it does not make sense economically to optimize it, but real coders love to code and do not care about economics, they optimize the AppStart() method into 100% assembly even though it is only called once :) Or use a z rotation matrix to rotate Tetris pieces 90 degrees :P Anyone who does that is awesome!
If anyone has some constructive correction, which is not VERY hurtful, then I would love to hear it, I code almost entirely by myself, so I am not really exposed to any influences. I once paid a nice Canadian game developer to teach my Direct3d and though I could just as easily have read a book, the interaction with another coder who was somewhat above my level in certain areas was fun.
Thanks for good content generally. I think I will go and answer some of the simpler questions, give a little back.

Resources