Related
I have several similar functions, say A, B, C. I want to choose one of them with command line options. Also, I'm calling that function billion times because of that instead of checking a variable inside a function billion times, I'm defining a function pointer Phi and set it to desired function just one time. But when I set, Phi = A, (so no user input considered) my code runs in ~24 secs, when I add an if-else and set Phi to desired function, my code runs in ~30 secs with exact same parameters. (Of course command line option sets Phi to A) What is the efficient way to handle this case?
My functions:
double funcA(double r)
{
return 0;
}
double funcB(double r)
{
return 1;
}
double funcC(double r)
{
return r;
}
void computationFunctionFast(Context *userInputs) {
double (*Phi)(double) = funcA;
/* computation codes */
}
void computationFunctionSlow(Context *userInputs) {
double (*Phi)(double);
switch (userInputs->funcEnum) {
case A:
Phi = funcA;
break;
case B:
Phi = funcB;
break;
case C:
Phi = funcC;
}
/* computation codes */
}
I've tried gcc, clang, icx with -O2 and -O3 optimizations. (gcc has no performance difference in mentioned cases but has the worst performance) Although I'm using C, I've tried std::function too. I've tried defining Phi function in different scopes etc.
Generally, there are a few things here that are slightly bad for performance:
Branches/comparisons lead to inefficient use of branch prediction/instruction cache and might affect pipelining too.
Function pointers are notoriously inefficient since they generally block inlining and generally the compiler can't do much about them.
Here's an example based on your code:
double computationFunctionSlow (int input, double val) {
double (*Phi)(double);
switch (input) {
case 0: Phi = funcA; break;
case 1: Phi = funcB; break;
case 2: Phi = funcC; break;
}
double res = Phi(val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
cmp edi, 2
ja .LBB3_1
movsxd rax, edi
lea rcx, [rip + .Lswitch.table.computationFunctionSlow]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.LBB3_1:
xorps xmm0, xmm0
ret
.Lswitch.table.computationFunctionSlow:
.quad funcA
.quad funcB
.quad funcC
Even though the numbers I picked are adjacent, the usual compilers fail to optimize out the comparison cmp. Even when I include a default: return 0; it is still there. You can quite easily manually optimize any switch with contiguous indices like this into a function pointer jump table:
double computationFunctionSlow (int input, double val) {
double (*Phi[3])(double) = {funcA, funcB, funcC};
double res = Phi[input](val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
movsxd rax, edi
lea rcx, [rip + .L__const.computationFunctionSlow.Phi]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.L__const.computationFunctionSlow.Phi:
.quad funcA
.quad funcB
.quad funcC
This leads to slightly better code here as the comparison instruction/branch is now removed. However, this is really a micro optimization that shouldn't have that much impact of performance. You have to benchmark it for sure to see if there's any improvement.
(Also gcc 12.2 didn't optimize this code as good, why I went with clang for this example.)
Godbolt link: https://godbolt.org/z/ja4zerj7o
There isn't a more "efficient" way to handle this case, you are already doing what you should.
The difference in timing you observe is because:
In the first case (Phi = funcA) the compiler knows the function will always be the same and is therefore able to optimize its calls. Depending on what your "computation code" does, this could mean inlining the function and simplifying a lot of calculations for you.
In the second case (Phi = <choice from user>) the compiler cannot know which function will be selected, and therefore cannot optimize any of the calls made to it by the rest of the code. It also cannot propagate optimizations to other parts of your "computation code" like in the first case.
In general, there isn't much you can do. Dynamic function pointers inherently add a bit of runtime overhead and make optimizations harder (or impossible).
What you could try is duplicating the "computation code" inside different functions or different branches that you only enter after asserting that Phi is equal to a constant, like so:
void computationFunctionSlow(Context *userInputs) {
if (userInputs->funcEnum == A) {
const double (*Phi)(double) = funcA;
// computation code
} else if (...) {
// ...
}
}
In the above piece of code, the compiler knows that inside any of those if blocks the value of Phi can only have one value, and could therefore be able to perform the same optimizations discussed in point 1 above.
There's no need to put an enum in your userInputs when all you do with it is use it to select a function pointer. Just add the function pointer in the structure directly and eliminate the branching done on every call.
Instead of
struct Context
{
.
.
.
enum funcType funcEnum;
};
use
struct Context
{
.
.
.
double (*phi)(double);
};
You'd wind up with something like this:
void computationFunctionSlow(Context *userInputs) {
/* computation codes */
double result = userInputs->phi( data );
}
It is said that returning an oversized struct by value (as opposed to returning a pointer to the struct) from a function incurs unnecessary copy on the stack. By "oversized", I mean a struct that cannot fit in the return registers.
However, to quote Wikipedia
When an oversized struct return is needed, another pointer to a caller-provided space is prepended as the first argument, shifting all other arguments to the right by one place.
and
When returning struct/class, the calling code allocates space and passes a pointer to this space via a hidden parameter on the stack. The called function writes the return value to this address.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
If the function inlines, the copying through the return-value object can be fully optimized away. Otherwise, maybe not, and arg copying definitely can't be.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
It depends what the caller does with the return value,; if it's assigned to a provably private object (escape analysis), that object can be the return-value object, passed as the hidden pointer.
But if the caller actually wants to assign the return value to other memory, then it does need a temporary.
struct large retval = some_func(); // no extra copying at all
*p = some_func() // caller will make space for a local return-value object & copy.
(Unless the compiler knows that p is just pointing to a local struct large tmp;, and escape analysis can prove that there's no way some global variable could have a pointer to that same tmp var.)
long version, same thing with more details:
In the C abstract machine, there's a "return value object", and return foo copies the named variable foo to that object, even if it's a large struct. Or return (struct lg){1,2}; copies an anonymous struct. The return-value object itself is anonymous; nothing can take its address. (You can't int *p = &foo(123);). This makes it easier to optimize away.
In the caller, that anonymous return-value object can be assigned to whatever you want, which would be another copy if compilers didn't optimize anything. (All of this applies for any type, even int). Of course, compilers that aren't total garbage will avoid some, ideally all, of that copying, when doing so can't possibly change the observable results. And that depends on the design of the calling convention. As you say, most conventions, including all the mainstream x86 and x86-64 conventions, pass a "hidden pointer" arg for return values they choose not to return in register(s) for whatever reason (size, C++ having a non-trivial constructor).
struct large retval = foo(...);
For such calling conventions, the above code is effectively transformed to
struct large retval;
foo(&retval, ...);
So it's C return-value object actually is a local in the stack-frame of its caller. foo() is allowed to store into that return-value object whenever it wants during execution, including before reading some other objects. This allows optimization within the callee (foo) as well, so a struct large tmp = ... / return tmp can be optimized away to just store into the return-value object.
So there's zero extra copying when the caller does just want to assign the function return value to a newly declared local var. (Or to a local var which it can prove is still private, via escape analysis. i.e. not pointed-to by any global vars).
But what if the caller wants to store the return value somewhere else?
void caller2(struct large *lgp) {
*lgp = foo();
}
Can *lgp be the return-value object, or do we need to introduce a local temporary?
void caller2(struct large *lgp) {
// foo_asm(lgp); // nope, possibly unsafe
struct large retval; foo(&retval); *lgp = retval; // safe
}
If you want functions to be able to write large structs to arbitrary locations, you have to "sign off" on it by making that effect visible in your source.
What prevents the usage of a function argument as hidden pointer? for more details about why *lgp can't be the return-value object / hidden pointer, and another example. "A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else". Also details of whether struct large *restrict lgp would make it safe: probably yes if the function doesn't longjmp (otherwise stores to the supposedly anonymous retval object might end up as visible side effects without return having been reached), but GCC doesn't look for that optimization.
Why is tailcall optimization not performed for types of class MEMORY? - return bar() where bar returns the same struct should be possible as an optimized tailcall, causing extra copying. This can even introduce extra copying of the whole struct, as well as failing to optimize call bar / ret into jmp bar.
how c compiler treats a struct return value from a function, in ASM - thresholds for returning in registers. e.g. i386 System V always returns structs in memory, even struct {int x;};.
Is it possible within a function to get the memory address of the variable initialized by the return value?
C/C++ returning struct by value under the hood an actual example (but unfortunately using debug-mode compiler-generated asm, so it contains copying that isn't necessary).
How do objects work in x86 at the assembly level? example at the bottom of how x86-64 System V packs the bytes of a struct into RDX:RAX, or just RAX if less than 8 bytes.
An example showing early stores to the return-value object (instead of copying)
(all source + asm on the Godbolt compiler explorer)
// more or less extra size will get compilers to copy it around with SSE2 or not
struct large { int first, second; char pad[0];};
int *global_ptr;
extern int a;
NOINLINE // __attribute__((noinline))
struct large foo() {
struct large tmp = {1,2};
if (a)
tmp.second = *global_ptr;
return tmp;
}
(targeting GNU/Linux) clang -m32 -O3 -mregparm=1 creates an implementation that writes its return-value object before it's done reading everything else, exactly the case that would make it unsafe for the caller to pass a pointer to some globally-reachable memory.
The asm makes it clear that tmp is fully optimized away, or is the retval object.
# clang -O3 -m32 -mregparm=1
foo:
mov dword ptr [eax + 4], 2
mov dword ptr [eax], 1 # store tmp into the retval object
cmp dword ptr [a], 0
je .LBB0_2 # if (a == 0) goto ret
mov ecx, dword ptr [global_ptr] # load the global
mov ecx, dword ptr [ecx] # deref it
mov dword ptr [eax + 4], ecx # and store to the retval object
.LBB0_2:
ret
(-mregparm=1 means pass the first arg in EAX, less noisy and easier to quickly visually distinguish from stack space than passing on the stack. Fun fact: i386 Linux compiles the kernel with -mregparm=3. But fun fact #2: if a hidden pointer is passed on the stack (i.e. no regparm), that arg is callee pops, unlike the rest. The function will use ret 4 to do ESP+=4 after popping the return address into EIP.)
In a simple caller, the compiler just reserves some stack space, passes a pointer to it, and then can load member variables from that space.
int caller() {
struct large lg = {4, 5}; // initializer is dead, foo can't read its retval object
lg = foo();
return lg.second;
}
caller:
sub esp, 12
mov eax, esp
call foo
mov eax, dword ptr [esp + 4]
add esp, 12
ret
But with a less trivial caller:
int caller() {
struct large lg = {4, 5};
global_ptr = &lg.first;
// unknown(&lg); // or this: as a side effect, might set global_ptr = &tmp->first;
lg = foo(); // (except by inlining) the compiler can't know if foo() looks at global_ptr
return lg.second;
}
caller:
sub esp, 28 # reserve space for 2 structs, and alignment
mov dword ptr [esp + 12], 5
mov dword ptr [esp + 8], 4 # materialize lg
lea eax, [esp + 8]
mov dword ptr [global_ptr], eax # point global_ptr at it
lea eax, [esp + 16] # hidden first arg *not* pointing to lg
call foo
mov eax, dword ptr [esp + 20] # reload from the retval object
add esp, 28
ret
Extra copying with *lgp = foo();
int caller2(struct large *lgp) {
global_ptr = &lgp->first;
*lgp = foo();
return lgp->second;
}
# with GCC11.1 this time, SSE2 8-byte copying unlike clang
caller2: # incoming arg: struct large *lgp in EAX
push ebx #
mov ebx, eax # lgp, tmp89 # lgp needed after foo returns
sub esp, 24 # reserve space for a retval object (and waste 16 bytes)
mov DWORD PTR global_ptr, eax # global_ptr, lgp
lea eax, [esp+8] # hidden pointer to the retval object
call foo #
movq xmm0, QWORD PTR [esp+8] # 8-byte copy of both halves
movq QWORD PTR [ebx], xmm0 # *lgp_2(D), tmp86
mov eax, DWORD PTR [ebx+4] # lgp_2(D)->second, lgp_2(D)->second # reload int return value
add esp, 24
pop ebx
ret
The copy to *lgp needs to happen, but it's somewhat of a missed optimization to reload from there, instead of from [esp+12]. (Saves a byte of code size at the cost of more latency.)
Clang does the copy with two 4-byte integer register mov loads/stores, but one of them is into EAX so it already has the return value ready.
You might also want to look at the result of assigning to memory freshly allocated with malloc. Compilers know that nothing else can (legally) be pointing to the newly allocated memory: that would be use-after-free undefined behaviour. So they may allow passing on a pointer from malloc as the return-value object if it hasn't been passed to anything else yet.
Related fun fact: passing large structs by value always requires a copy (if the function doesn't inline). But as discussed in comments, the details depend on the calling convention. Windows differs from i386 / x86-64 System V calling conventions (all non-Windows OSes) on this:
SysV calling conventions copy the whole struct to the stack. (if they're too large to fit in a pair of registers for x86-64)
Windows x64 makes a copy and passes (like a normal arg) a pointer to that copy. The callee "owns" the arg and can modify it, so a tmp copy is still needed. (And no, const struct large foo has no effect.)
https://godbolt.org/z/ThMrE9rqT shows x86-64 GCC targeting Linux vs. x64 MSVC targeting Windows.
This really depends on your compiler, but in general the way this works is that the caller allocates the memory for the struct return value, but the callee also allocates stack space for any intermediate value of that structure. This intermediate allocation is used when the function is running, and then the struct is copied onto the caller's memory when the function returns.
For reference as to why your solution won't always work, consider a program which has two of the same struct and returns one based on some condition:
large_t returntype(int condition) {
large_t var1 = {5};
large_t var2 = {6};
// More intermediate code here
if(condition) return var1;
else return var2;
}
In this case, both may be required by the intermediate code, but the return value is not known at compile time, so the compiler doesn't know which to initialize on the caller's stack space. It's easier to just keep it local and copy on return.
EDIT: Your solution may be the case in simple functions, but it really depends on the optimizations performed by each individual compiler. If you're really interested in this, check out https://godbolt.org/
I'm writing C code for an embedded system. In this system, there are memory mapped registers at some fixed address in the memory map, and of course some RAM where my data segment / heap is.
I'm finding problems generating optimal code when my code is intermixing accesses to global variables in the data segment and accesses to hardware registers. This is a simplified snippet:
#include <stdint.h>
uint32_t * const restrict HWREGS = 0x20000;
struct {
uint32_t a, b;
} Context;
void example(void) {
Context.a = 123;
HWREGS[0x1234] = 5;
Context.b = Context.a;
}
This is the code generated on x86 (see also on godbolt):
example:
mov DWORD PTR Context[rip], 123
mov DWORD PTR ds:149712, 5
mov eax, DWORD PTR Context[rip]
mov DWORD PTR Context[rip+4], eax
ret
As you can see, after having written the hardware register, Context.a is reloaded from RAM before being stored into Context.b. This doesn't make sense because Context is at a different memory address than HWREGS. In other words, the memory pointed by HWREGS and the memory pointed by &Context do not alias, but it looks like there is not way to tell that to the compiler.
If I change HWREGS definition as this:
extern uint32_t * const restrict HWREGS;
that is, I hide the fixed memory address to the compiler, I get this:
example:
mov rax, QWORD PTR HWREGS[rip]
mov DWORD PTR [rax+18640], 5
movabs rax, 528280977531
mov QWORD PTR Context[rip], rax
ret
Context:
.zero 8
Now the two writes to Context are optimized (even coalesced to a single write), but on the other hand the access to the hardware register does not happen anymore with a direct memory access but it goes through a pointer indirection.
Is there a way to obtain optimal code here? I would like GCC to know that HWREGS is at a fixed memory address and at the same time to tell it that it does not alias Context.
If you want to avoid compilers reloading regularly values from a memory region (possibly due to aliasing), then the best is not to use global variables, or at least not to use direct accesses to global variables. The register keyword seems ignored for global variables (especially here on HWREGS) for both GCC and Clang. Using the restrict keyword on function parameters solves this problem:
#include <stdint.h>
uint32_t * const HWREGS = 0x20000;
struct Context {
uint32_t a, b;
} context;
static inline void exampleWithLocals(uint32_t* restrict localRegs, struct Context* restrict localContext) {
localContext->a = 123;
localRegs[0x1234] = 5;
localContext->b = localContext->a;
}
void example() {
exampleWithLocals(HWREGS, &context);
}
Here is the result (see also on godbolt):
example:
movabs rax, 528280977531
mov DWORD PTR ds:149712, 5
mov QWORD PTR context[rip], rax
ret
context:
.zero 8
Please note that the strict aliasing rule do not help in this case since the type of read/written variables/fields is always uint32_t.
Besides this, based on its name, the variable HWREGS looks like a hardware register. Please note that it should be put volatile so that compiler do not keep it to registers nor perform any similar optimization (like assuming the pointed value is left unchanged if the code do not change it).
This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.
I am searching for a faster method of accomplishing this:
int is_empty(char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[i] != 0) return 0;
}
return 1;
}
I realize I'm searching for a micro optimization unnecessary except in extreme cases, but I know a faster method exists, and I'm curious what it is.
On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.
Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.
Expressing this is difficult in portable C: casting a char* to long* violates strict aliasing. But fortunately you can use memcpy to portably express an unaligned multi-byte load that can alias anything. Compilers will optimize it to the asm you want.
For example, this work-in-progress implementation (https://godbolt.org/z/3hXQe7) on the Godbolt compiler explorer shows that you can get a good inner loop (with some startup overhead) from loading two consecutive uint_fast32_t vars (often 64-bit) with memcpy and then checking tmp1 | tmp2, because many CPUs will set flags according to an OR result, so this lets you check two words for the price of one.
Getting it to compile efficiently for targets without efficient unaligned loads requires some manual alignment in the startup code, and even then gcc may not inline the memcpy for loads where it can't prove alignment.
One potential way, inspired by Kieveli's dismissed idea:
int is_empty(char *buf, size_t size)
{
static const char zero[999] = { 0 };
return !memcmp(zero, buf, size > 999 ? 999 : size);
}
Note that you can't make this solution work for arbitrary sizes. You could do this:
int is_empty(char *buf, size_t size)
{
char *zero = calloc(size);
int i = memcmp(zero, buf, size);
free(zero);
return i;
}
But any dynamic memory allocation is going to be slower than what you have. The only reason the first solution is faster is because it can use memcmp(), which is going to be hand-optimized in assembly language by the library writers and will be much faster than anything you could code in C.
EDIT: An optimization no one else has mentioned, based on earlier observations about the "likelyness" of the buffer to be in state X: If a buffer isn't empty, will it more likely not be empty at the beginning or the end? If it's more likely to have cruft at the end, you could start your check at the end and probably see a nice little performance boost.
EDIT 2: Thanks to Accipitridae in the comments:
int is_empty(char *buf, size_t size)
{
return buf[0] == 0 && !memcmp(buf, buf + 1, size - 1);
}
This basically compares the buffer to itself, with an initial check to see if the first element is zero. That way, any non-zero elements will cause memcmp() to fail. I don't know how this would compare to using another version, but I do know that it will fail quickly (before we even loop) if the first element is nonzero. If you're more likely to have cruft at the end, change buf[0] to buf[size] to get the same effect.
The benchmarks given above (https://stackoverflow.com/a/1494499/2154139) are not accurate. They imply that func3 is much faster than the other options.
However, if you change the order of the tests, so that func3 comes before func2, you'd see func2 is much faster.
Careful when running combination benchmarks within a single execution... the side effects are large, especially when reusing the same variables. Better to run the tests isolated!
For example, changing it to:
int main(){
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
MEASURE( func3 );
}
gives me:
func3: zero 14243
func3: zero 1142
func3: zero 885
func3: zero 848
func3: zero 870
This was really bugging me as I couldn't see how func3 could perform so much faster than func2.
(apologize for the answer, and not as a comment, didn't have reputation)
Four functions for testing zeroness of a buffer with simple benchmarking:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <inttypes.h>
#define SIZE (8*1024)
char zero[SIZE] __attribute__(( aligned(8) ));
#define RDTSC(var) __asm__ __volatile__ ( "rdtsc" : "=A" (var));
#define MEASURE( func ) { \
uint64_t start, stop; \
RDTSC( start ); \
int ret = func( zero, SIZE ); \
RDTSC( stop ); \
printf( #func ": %s %12"PRIu64"\n", ret?"non zero": "zero", stop-start ); \
}
int func1( char *buff, size_t size ){
while(size--) if(*buff++) return 1;
return 0;
}
int func2( char *buff, size_t size ){
return *buff || memcmp(buff, buff+1, size-1);
}
int func3( char *buff, size_t size ){
return *(uint64_t*)buff || memcmp(buff, buff+sizeof(uint64_t), size-sizeof(uint64_t));
}
int func4( char *buff, size_t size ){
return *(wchar_t*)buff || wmemcmp((wchar_t*)buff, (wchar_t*)buff+1, size/sizeof(wchar_t)-1);
}
int main(){
MEASURE( func1 );
MEASURE( func2 );
MEASURE( func3 );
MEASURE( func4 );
}
Result on my old PC:
func1: zero 108668
func2: zero 38680
func3: zero 8504
func4: zero 24768
If your program is x86 only or x64 only, you can easily optimize using inline assambler. The REPE SCASD instruction will scan a buffer until a non EAX dword is found.
Since there is no equivalent standard library function, no compiler/optimizer will probably be able to use these instructions (as Confirmed by Sufian's code).
From the head, something like this would do if your buffer length is 4-bytes aligned (MASM syntax):
_asm {
CLD ; search forward
XOR EAX, EAX ; search for non-zero
LEA EDI, [buf] ; search in buf
MOV ECX, [buflen] ; search buflen bytes
SHR ECX, 2 ; using dwords so len/=4
REPE SCASD ; perform scan
JCXZ bufferEmpty: ; completes? then buffer is 0
}
Tomas
EDIT: updated with Tony D's fixes
For something so simple, you'll need to see what code the compiler is generating.
$ gcc -S -O3 -o empty.s empty.c
And the contents of the assembly:
.text
.align 4,0x90
.globl _is_empty
_is_empty:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %edx ; edx = pointer to buffer
movl 8(%ebp), %ecx ; ecx = size
testl %edx, %edx
jle L3
xorl %eax, %eax
cmpb $0, (%ecx)
jne L5
.align 4,0x90
L6:
incl %eax ; real guts of the loop are in here
cmpl %eax, %edx
je L3
cmpb $0, (%ecx,%eax) ; compare byte-by-byte of buffer
je L6
L5:
leave
xorl %eax, %eax
ret
.align 4,0x90
L3:
leave
movl $1, %eax
ret
.subsections_via_symbols
This is very optimized. The loop does three things:
Increase the offset
Compare the offset to the size
Compare the byte-data in memory at base+offset to 0
It could be optimized slightly more by comparing at a word-by-word basis, but then you'd need to worry about alignment and such.
When all else fails, measure first, don't guess.
Try checking the buffer using an int-sized variable where possible (it should be aligned).
Off the top of my head (uncompiled, untested code follows - there's almost certainly at least one bug here. This just gives the general idea):
/* check the start of the buf byte by byte while it's unaligned */
while (size && !int_aligned( buf)) {
if (*buf != 0) {
return 0;
}
++buf;
--size;
}
/* check the bulk of the buf int by int while it's aligned */
size_t n_ints = size / sizeof( int);
size_t rem = size / sizeof( int);
int* pInts = (int*) buf;
while (n_ints) {
if (*pInt != 0) {
return 0;
}
++pInt;
--n_ints;
}
/* now wrap up the remaining unaligned part of the buf byte by byte */
buf = (char*) pInts;
while (rem) {
if (*buf != 0) {
return 0;
}
++buf;
--rem;
}
return 1;
With x86 you can use SSE to test 16 bytes at a time:
#include "smmintrin.h" // note: requires SSE 4.1
int is_empty(const char *buf, const size_t size)
{
size_t i;
for (i = 0; i + 16 <= size; i += 16)
{
__m128i v = _mm_loadu_si128((m128i *)&buf[i]);
if (!_mm_testz_si128(v, v))
return 0;
}
for ( ; i < size; ++i)
{
if (buf[i] != 0)
return 0;
}
return 1;
}
This can probably be further improved with loop unrolling.
On modern x86 CPUs with AVX you can even use 256 bit SIMD and test 32 bytes at a time.
The Hackers Delight book/site is all about optimized C/assembly. Lots of good references from that site also and is fairly up to date (AMD64, NUMA techniques also).
Look at fast memcpy - it can be adapted for memcmp (or memcmp against a constant value).
I see a lot of people saying things about alignment issues preventing you from doing word sized accesses, but that's not always true. If you're looking to make portable code, then this is certainly an issue, however x86 will actually tolerate misaligned accesses. For exmaple this will only fail on the x86 if alignment checking is turned on in EFLAGS (and of course buf is actuallly not word aligned).
int is_empty(char * buf, int size) {
int i;
for(i = 0; i < size; i+= 4) {
if(*(int *)(buf + i) != 0) {
return 0;
}
}
for(; i < size; i++) {
if(buf[i] != 0)
return 0;
}
return 1;
}
Regardless the compiler CAN convert your original loop into a loop of word-based comparisons with extra jumps to handle alignment issues, however it will not do this at any normal optimization level because it lacks information. For cases when size is small, unrolling the loop in this way will make the code slower, and the compiler wants to be conservative.
A way to get around this is to make use of profile guided optimizations. If you let GCC get profile information on the is_empty function then re-compile it, it will be willing to unroll the loop into word-sized comparisons with an alignment check. You can also force this behavior with -funroll-all-loops
Did anyone mention unrolling the loop? In any of these loops, the loop overhead and indexing is going to be significant.
Also, what is the probability that the buffer will actually be empty? That's the only case where you have to check all of it.
If there typically is some garbage in the buffer, the loop should stop very early, so it doesn't matter.
If you plan to clear it to zero if it's not zero, it would probably be faster just to clear it with memset(buf, 0, sizeof(buf)), whether or not it's already zero.
What about looping from size to zero (cheaper checks):
int is_empty(char * buf, int size)
{
while(size --> 0) {
if(buf[size] != 0) return 0;
}
return 1;
}
It must be noted that we probably cannot outperform the compiler, so enable the most aggressive speed optimization in your compiler and assume that you're likely to not go any faster.
Or handling everything using pointers (not tested, but likely to perform quite good):
int is_empty(char* buf, int size)
{
char* org = buf;
if (buf[size-1] == 1)
return 0;
buf[size-1] = 1;
while(! *buf++);
buf--;
return buf == org[size-1];
}
You stated in your question that you are looking for a most likely unnecessary micro-optimization. In 'normal' cases the ASM approach by Thomas and others should give you the fastest results.
Still, this is forgetting the big picture. If your buffer is really large, then starting from the start and essential do a linear search is definitely not the fastest way to do this. Assume your cp replacement is quite good at finding large consecutive empty regions but has a few non-empty bytes at the end of the array. All linear searches would require reading the whole array. On the other hand a quicksort inspired algorithm could search for any non-zero elements and abort much faster for a large enough dataset.
So before doing any kind of micro-optimization I would look closely at the data in your buffer and see if that gives you any patterns. For a single '1', randomly distributed in the buffer a linear search (disregarding threading/parallelization) will be the fastest approach, in other cases not necessarily so.
Inline assembly version of the initial C code (no error checking, if uiSize is == 0 and/or the array is not allocated exceptions will be generated. Perhaps use try {} catch() as this might be faster than adding a lot of check to the code. Or do as I do, try not to call functions with invalid values (usually does not work). At least add a NULL pointer check and a size != 0 check, that is very easy.
unsigned int IsEmpty(char* pchBuffer, unsigned int uiSize)
{
asm {
push esi
push ecx
mov esi, [pchBuffer]
mov ecx, [uiSize]
// add NULL ptr and size check here
mov eax, 0
next_char:
repe scasb // repeat string instruction as long as BYTE ptr ds:[ESI] == 0
// scasb does pointer arithmetic for BYTES (chars), ie it copies a byte to al and increments ESI by 1
cmp cx,0 // did the loop complete?
je all_chars_zero // yes, array is all 0
jmp char_not_zero // no, loop was interrupted due to BYTE PTR ds:[ESI] != 0
all_chars_zero:
mov eax, 1 // Set return value (works in MASM)
jmp end
char_not_zero:
mov eax, 0 // Still not sure if this works in inline asm
end:
pop ecx
pop esi
}
}
That is written on the fly, but it looks correct enough, corrections are welcome. ANd if someone known about how to set the return value from inline asm, please do tell.
int is_empty(char * buf, int size)
{
int i, content=0;
for(i = 0; !content && i < size; i++)
{
content=content | buf(i); // bitwise or
}
return (content==0);
}
int is_empty(char * buf, int size)
{
return buf[0] == '\0';
}
If your buffer is not a character string, I think that's the fastest way to check...
memcmp() would require you to create a buffer the same size and then use memset to set it all as 0. I doubt that would be faster...
Edit: Bad answer
A novel approach might be
int is_empty(char * buf, int size) {
char start = buf[0];
char end = buff[size-1];
buf[0] = 'x';
buf[size-1] = '\0';
int result = strlen(buf) == 0;
buf[0] = start;
buff[size-1] = end;
return result;
}
Why the craziness? because strlen is one of the library function that's more likely to be optimized.
Storing and replacing the first character is to prevent the false positive. Storing and replacing the last character is to make sure it terminates.
The initial C algorithm is pretty much as slow as it can be in VALID C.
If you insist on using C then try a "while" loop instead of "for":
int i = 0;
while (i< MAX)
{
// operate on the string
i++;
}
This is pretty much the fastest 1 dimensional string operation loop you can write in C, besides if you can force the compiler to put i in a register with the "register" keyword, but I am told that this is almost always ignored by modern compilers.
Also searching a constant sized array to check if it is empty is very wasteful and also 0 is not empty, it is value in the array.
A better solution for speed would to use a dynamic array (int* piBuffer) and a variable that stores the current size (unsigned int uiBufferSize), when the array is empty then the pointer is NULL, and uiBufferSize is 0. Make a class with these two as protected member variables. One could also easily write a template for dynamic arrays, which would store 32 bit values, either primitive types or pointers, for primitive types there is not really any way to test for "empty" (I interpret this as "undefined"), but you can of course define 0 to represent an available entry. For an array pointers you should initialize all entries to NULL, and set entry to NULL when you have just deallocated that memory. And NULL DOES mean "points at nothing" so this is very convenient way to represent empty. One should not use dynamically resized arrays in really complicated algorithms, at least not in the development phase, there are simply too many things that can go wrong. One should at least first implement the algorithm using an STL Container (or well tested alternative) and then when the code works one can swap the tested container for a simple dynamic array (and if you can avoid resizing the array too often the code will both be faster and more fail safe.
A better solution for complicated and cool code is to use either std::vector or a std::map (or any container class STL, homegrown or 3rd party) depending on your needs, but looking at your code I would say that the std::vector is enough. The STL Containers are templates so they should be pretty fast too. Use STL Container to store object pointers (always store object pointers and not the actual objects, copying entire objects for every entry will really mess up your execution speed) and dynamic arrays for more basic data (bitmap, sound etc.) ie primitive types. Generally.
I came up with the REPE SCASW solution independtly by studying x86 assembly language manuals, and I agree that the example using this string operation instruction is the fastest. The other assembly example which has separate compare, jump etc. instructions is almost certainly slower (but still much faster than the initial C code, so still a good post), as the string operations are among the most highly optimized on all modern CPUs, they may even have their own logic circuitry (anyone knows?).
The REPE SCASD does not need to fetch a new instruction nor increase the instruction pointer, and that is just the stuff an assembly novice like me can come up with and and on top of that is the hardware optimization, string operations are critical for almost all kinds of modern software in particular multimedia application (copy PCM sound data, uncompressed bitmap data, etc.), so optimizing these instructions must have been very high priority every time a new 80x86 chip was being designed.
I use it for a novel 2d sprite collision algorithm.
It says that I am not allowed to have an opinion, so consider the following an objective assessment: Modern compilers (UNMANAGED C/C++, pretty much everything else is managed code and is slow as hell) are pretty good at optimizing, but it cannot be avoided that for VERY specific tasks the compiler generates redundant code. One could look at the assembly that the compiler outputs so that one does not have to translate a complicated algorithm entirely from scratch, even though it is very fun to do (for some) and it is much more rewarding doing code the hard way, but anyway, algorithms using "for" loops, in particular with regards to string operations, can often be optimized very significantly as the for loop generates a lot of code, that is often not needed, example:
for (int i = 1000; i>0; i--) DoSomething(); This line generates at 6-10 lines of assembly if the compiler is not very clever (it might be), but the optimized assembly version CAN be:
mov cx, 1000
_DoSomething:
// loop code....or call Func, slower but more readable
loop _DoSomething
That was 2 lines, and it does exactly the same as the C line (it uses registers instead of memory addresses, which is MUCH faster, but arguably this is not EXACTLY the same as the C line, but that is semantics) , how much of an optimization this example is depends on how well modern compilers optimize, which I have no clue on, but the algorithm analysis based on the goal of implementing an algorithm with the fewest and faster assembly lines often works well, I have had very good results with first implementing the algorithm in C/C++ without caring about optimization and then translate and optimize it in assembly. The fact that each C line becomes many assembly lines often makes some optimizations very obvious, and also some instructions are faster than others:
INC DX ; is faster than:
ADD DX,1 ;if ADD DX,1 is not just replaced with INC DX by the assembler or the CPU
LOOP ; is faster than manually decreasing, comparing and jumping
REPxx STOSx/MOVSx/LODSx is faster than using cmp, je/jne/jea etc and loop
JMP or conditional jumping is faster than using CALL, so in a loop that is executed VERY frequently (like rendering), including functions in the code so it is accessible with "local" jumps can also boost performance.
The last bit is very relevant for this question, fast string operations.
So this post is not all rambling.
And lastly, design you assembly algorithm in the way that requires the least amount of jumps for a typical execution.
Also don't bother optimizing code that is not called that often, use a profiler and see what code is called most often, and start with that, anything that is called less than 20 times a second (and completes much faster than 1000 ms/ 20) is not really worth optimizing. Look at code that it not synchronized to timers and the like and is executed again immediately after is has completed. On the other hand if your rendering loop can do 100+ FPS on a modest machine, it does not make sense economically to optimize it, but real coders love to code and do not care about economics, they optimize the AppStart() method into 100% assembly even though it is only called once :) Or use a z rotation matrix to rotate Tetris pieces 90 degrees :P Anyone who does that is awesome!
If anyone has some constructive correction, which is not VERY hurtful, then I would love to hear it, I code almost entirely by myself, so I am not really exposed to any influences. I once paid a nice Canadian game developer to teach my Direct3d and though I could just as easily have read a book, the interaction with another coder who was somewhat above my level in certain areas was fun.
Thanks for good content generally. I think I will go and answer some of the simpler questions, give a little back.