possible to do if (!boolvar) { ... in 1 asm instruction? - c

This question is more out of curiousity than necessity:
Is it possible to rewrite the c code if ( !boolvar ) { ... in a way so it is compiled to 1 cpu instruction?
I've tried thinking about this on a theoretical level and this is what I've come up with:
if ( !boolvar ) { ...
would need to first negate the variable and then branch depending on that -> 2 instructions (negate + branch)
if ( boolvar == false ) { ...
would need to load the value of false into a register and then branch depending on that -> 2 instructions (load + branch)
if ( boolvar != true ) { ...
would need to load the value of true into a register and then branch ("branch-if-not-equal") depending on that -> 2 instructions (load + "branch-if-not-equal")
Am I wrong with my assumptions? Is there something I'm overlooking?
I know I can produce intermediate asm versions of programs, but I wouldn't know how to use this in a way so I can on one hand turn on compiler optimization and at the same time not have an empty if statement optimized away (or have the if statement optimized together with its content, giving some non-generic answer)
P.S.: Of course I also searched google and SO for this, but with such short search terms I couldn't really find anything useful
P.P.S.: I'd be fine with a semantically equivalent version which is not syntactical equivalent, e.g. not using if.
Edit: feel free to correct me if my assumptions about the emitted asm instructions are wrong.
Edit2: I've actually learned asm about 15yrs ago, and relearned it about 5yrs ago for the alpha architecture, but I hope my question is still clear enough to figure out what I'm asking. Also, you're free to assume any kind of processor extension common in consumer cpus up to AVX2 (current haswell cpu as of the time of writing this) if it helps in finding a good answer.

At the end of my post it will say why you should not aim for this behaviour (on x86).
As Jerry Coffin has written, most jumps in x86 depend on the flags register.
There is one exception though: The j*cxz set of instructions which jump if the ecx/rcx register is zero. To achieve this you need to make sure that your boolvar uses the ecx register. You can achieve that by specifically assigning it to that register
register int boolvar asm ("ecx");
But by far not all compilers use the j*cxz set of instructions. There is a flag for icc to make it do that, but it is generally not advisable. The Intel manual states that two instructions
test ecx, ecx
jz ...
are faster on the processor.
The reason for being this is that x86 is a CISC (complex) instruction set. In the actual hardware though the processor will split up complex instructions that appear as one instruction in the asm into multiple microinstructions which are then executed in a RISC style. This is the reason why not all instructions require the same execution time and sometimes multiple small ones are faster then one big one.
test and jz are single microinstructions, but jecxz will be decomposed into those two anyways.
The only reason why the j*cxz set of instructions exist is if you want to make a conditional jump without modifying the flags register.

Yes, it's possible -- but doing so will depend on the context in which this code takes place.
Conditional branches in an x86 depend upon the values in the flags register. For this to compile down to a single instruction, some other code will already need to set the correct flag, so all that's left is a single instruction like jnz wherever.
For example:
boolvar = x == y;
if (!boolvar) {
do_something();
}
...could end up rendered as something like:
mov eax, x
cmp eax, y ; `boolvar = x == y;`
jz #f
call do_something
##:
Depending on your viewpoint, it could even compile down to only part of an instruction. For example, quite a few instructions can be "predicated", so they're executed only if some previously defined condition is true. In this case, you might have one instruction for setting "boolvar" to the correct value, followed by one to conditionally call a function, so there's no one (complete) instruction that corresponds to the if statement itself.
Although you're unlikely to see it in decently written C, a single assembly language instruction could include even more than that. For an obvious example, consider something like:
x = 10;
looptop:
-- x;
boolvar = x == 0;
if (!boolvar)
goto looptop;
This entire sequence could be compiled down to something like:
mov ecx, 10
looptop:
loop looptop

Am I wrong with my assumptions
You are wrong with several assumptions. First you should know that 1 instruction is not necessarily faster than multiple ones. For example in newer μarchs test can macro-fuse with jcc, so 2 instructions will run as one. Or a division is so slow that in the same time tens or hundreds of simpler instructions may already finished. Compiling the if block to a single instruction doesn't worth it if it's slower than multiple instructions
Besides, if ( !boolvar ) { ... doesn't need to first negate the variable and then branch depending on that. Most jumps in x86 are based on flags, and they have both the yes and no conditions, so no need to negate the value. We can simply jump on non-zero instead of jump on zero
Similarly if ( boolvar == false ) { ... doesn't need to load the value of false into a register and then branch depending on that. false is a constant equal to 0, which can be embedded as an immediate in the instruction (like cmp reg, 0). But for checking against zero then just a simple test reg, reg is enough. Then jnz or jz will be used to jump on zero/non-zero, which will be fused with the previous test instruction into one
It's possible to make an if header or body that compiles to a single instruction, but it depends entirely on what you need to do, and what condition is used. Because the flag for boolvar may already be available from the previous statement, so the if block in the next line can use it to jump directly like what you see in Jerry Coffin's answer
Moreover x86 has conditional moves, so if inside the if is a simple assignment then it may be done in 1 instruction. Below is an example and its output
int f(bool condition, int x, int y)
{
int ret = x;
if (!condition)
ret = y;
return ret;
}
f(bool, int, int):
test dil, dil ; if(!condition)
mov eax, edx ; ret = y
cmovne eax, esi ; if(condition) ret = x
ret
Some other cases you don't even need a conditional move or jump. For example
bool f(bool condition)
{
bool ret = false;
if (!condition)
ret = true;
return ret;
}
compiles to a single xor without any jump at all
f(bool):
mov eax, edi
xor eax, 1
ret
ARM architecture (v7 and below) can run any instruction as conditional so that may translate to only one instruction
For example the following loop
while (i != j)
{
if (i > j)
{
i -= j;
}
else
{
j -= i;
}
}
can be translated to ARM assembly as
loop: CMP Ri, Rj ; set condition "NE" if (i != j),
; "GT" if (i > j),
; or "LT" if (i < j)
SUBGT Ri, Ri, Rj ; if "GT" (Greater Than), i = i-j;
SUBLT Rj, Rj, Ri ; if "LT" (Less Than), j = j-i;
BNE loop ; if "NE" (Not Equal), then loop

Related

Is there any way to generate inline assembly programmatically?

In my program I need to insert NOP as inline assembly into a loop, and the number of NOPs can be controlled by an argument. Something like this:
char nop[] = "nop\nnop";
for(offset = 0; offset < CACHE_SIZE; offset += BLOCK_SIZE) {
asm volatile (nop
:
: "c" (buffer + offset)
: "rax");
}
Is there any way to tell compiler to convert the above inline assembly into the following?
asm volatile ("nop\n"
"nop"
:
: "c" (buffer + offset)
: "rax");
Well, there is this trick you can do:
#define NOPS(n) asm volatile (".fill %c0, 1, 0x90" :: "i"(n))
This macro inserts the desired number of nop instructions into the instruction stream. Note that n must be a compile time constant. You can use a switch statement to select different lengths:
switch (len) {
case 1: NOPS(1); break;
case 2: NOPS(2); break;
...
}
You can also do this for more code size economy:
if (len & 040) NOPS(040);
if (len & 020) NOPS(020);
if (len & 010) NOPS(010);
if (len & 004) NOPS(004);
if (len & 002) NOPS(002);
if (len & 001) NOPS(001);
Note that you should really consider using pause instructions instead of nop instructions for this sort of thing as pause is a semantic hint that you are just trying to pass time. This changes the definition of the macro to:
#define NOPS(n) asm volatile (".fill %c0, 2, 0x90f3" :: "i"(n))
No, the inline asm template needs to be compile-time constant, so the assembler can assemble it to machine code.
If you want a flexible template that you modify at run-time, that's called JIT compiling or code generation. You normally generate machine-code directly, not assembler source text which you feed to an assembler.
For example, see this complete example which generates a function composed of a variable number of dec eax instructions and then executes it. Code golf: The repetitive byte counter
BTW, dec eax runs at 1 per clock on all modern x86 CPUs, unlike NOP which runs at 4 per clock, or maybe 5 on Ryzen. See http://agner.org/optimize/.
A better choice for a tiny delay might be a pause instruction, or a dependency chain of some variable number of imul instructions, or maybe sqrtps, ending with an lfence to block out-of-order execution (at least on Intel CPUs). I haven't checked AMD's manuals to see if lfence is documented as being an execution barrier there, but Agner Fog reports it can run at 4 per clock on Ryzen.
But really, you probably don't need to JIT any code at all. For a one-off experiment that only has to work on one or a few systems, hack up a delay loop with something like
for (int i=0 ; i<delay_count ; i++) {
asm volatile("" : "r" (i)); // defeat optimization
}
This forces the compiler to have the loop counter in a register on every iteration, so it can't optimize the loop away, or turn it into a multiply. You should get compiler-generated asm like delayloop: dec eax; jnz delayloop. You might want to put _mm_lfence() after the loop.

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

Working inline assembly in C for bit parity?

I'm trying to compute the bit parity of a large number of uint64's. By bit parity I mean a function that accepts a uint64 and outputs 0 if the number of set bits is even, and 1 otherwise.
Currently I'm using the following function (by #Troyseph, found here):
uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111) * 0x1111111111111111;
return (n >> 60) & 1;
}
The same SO page has the following assembly routine (by #papadp):
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
which takes advantage of the machine's parity flag. But I cannot get it to work with my C program (I know next to no assembly).
Question. How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
(I'm using GCC with 64-bit Ubuntu 14 on an Intel Xeon Haswell)
In case it's of any help, the parity64() function is called inside the following routine:
uint bindot(uint64* a, uint64* b, uint64 entries){
uint parity = 0;
for(uint i=0; i<entries; ++i)
parity ^= parity64(a[i] & b[i]); // Running sum!
return parity;
}
(This is supposed to be the "dot product" of two vectors over the field Z/2Z, aka. GF(2).)
This may sound a bit harsh, but I believe it needs to be said. Please don't take it personally; I don't mean it as an insult, especially since you already admitted that you "know next to no assembly." But if you think code like this:
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
will beat what a C compiler generates, then you really have no business using inline assembly. In just those 5 lines of code, I see 2 instructions that are glaringly sub-optimal. It could be optimized by just rewriting it slightly:
xor eax, eax
test ecx, ecx ; logically, should use RCX, but see below for behavior of PF
jnp jmp_over
mov eax, 1 ; or possibly even "inc eax"; would need to verify
jmp_over:
ret
Or, if you have random input values that are likely to foil the branch predictor (i.e., there is no predictable pattern to the parity of the input values), then it would be faster yet to remove the branch, writing it as:
xor eax, eax
test ecx, ecx
setp al
ret
Or perhaps the equivalent (which will be faster on certain processors, but not necessarily all):
xor eax, eax
test ecx, ecx
mov ecx, 1
cmovp eax, ecx
ret
And these are just the improvements I could see off the top of my head, given my existing knowledge of the x86 ISA and previous benchmarks that I have conducted. But lest anyone be fooled, this is undoubtedly not the fastest code, because (borrowing from Michael Abrash), "there ain't no such thing as the fastest code"—someone can virtually always make it faster yet.
There are enough problems with using inline assembly when you're an expert assembly-language programmer and a wizard when it comes to the intricacies of the x86 ISA. Optimizers are pretty darn good nowadays, which means it's hard enough for a true guru to produce better code (though certainly not impossible). It also takes trustworthy benchmarks that will verify your assumptions and confirm that your optimized inline assembly is actually faster. Never commit yourself to using inline assembly to outsmart the compiler's optimizer without running a good benchmark. I see no evidence in your question that you've done anything like this. I'm speculating here, but it looks like you saw that the code was written in assembly and assumed that meant it would be faster. That is rarely the case. C compilers ultimately emit assembly language code, too, and it is often more optimal than what us humans are capable of producing, given a finite amount of time and resources, much less limited expertise.
In this particular case, there is a notion that inline assembly will be faster than the C compiler's output, since the C compiler won't be able to intelligently use the x86 architecture's built-in parity flag (PF) to its benefit. And you might be right, but it's a pretty shaky assumption, far from universalizable. As I've said, optimizing compilers are pretty smart nowadays, and they do optimize to a particular architecture (assuming you specify the right options), so it would not at all surprise me that an optimizer would emit code that used PF. You'd have to look at the disassembly to see for sure.
As an example of what I mean, consider the highly specialized BSWAP instruction that x86 provides. You might naïvely think that inline assembly would be required to take advantage of it, but it isn't. The following C code compiles to a BSWAP instruction on almost all major compilers:
uint32 SwapBytes(uint32 x)
{
return ((x << 24) & 0xff000000 ) |
((x << 8) & 0x00ff0000 ) |
((x >> 8) & 0x0000ff00 ) |
((x >> 24) & 0x000000ff );
}
The performance will be equivalent, if not better, because the optimizer has more knowledge about what the code does. In fact, a major benefit this form has over inline assembly is that the compiler can perform constant folding with this code (i.e., when called with a compile-time constant). Plus, the code is more readable (at least, to a C programmer), much less error-prone, and considerably easier to maintain than if you'd used inline assembly. Oh, and did I mention it's reasonably portable if you ever wanted to target an architecture other than x86?
I know I'm making a big deal of this, and I want you to understand that I say this as someone who enjoys the challenge of writing highly-tuned assembly code that beats the compiler's optimizer in performance. But every time I do it, it's just that: a challenge, which comes with sacrifices. It isn't a panacea, and you need to remember to check your assumptions, including:
Is this code actually a bottleneck in my application, such that optimizing it would even make any perceptible difference?
Is the optimizer actually emitting sub-optimal machine language instructions for the code that I have written?
Am I wrong in what I naïvely think is sub-optimal? Maybe the optimizer knows more than I do about the target architecture, and what looks like slow or sub-optimal code is actually faster. (Remember that less code is not necessarily faster.)
Have I tested it in a meaningful, real-world benchmark, and proven that the compiler-generated code is slow and that my inline assembly is actually faster?
Is there absolutely no way that I can tweak the C code to persuade the optimizer to emit better machine code that is close, equal to, or even superior to the performance of my inline assembly?
In an attempt to answer some of these questions, I set up a little benchmark. (Using MSVC, because that's what I have handy; if you're targeting GCC, it's best to use that compiler, but we can still get a general idea. I use and recommend Google's benchmarking library.) And I immediately ran into problems. See, I first run my benchmarks in "debugging" mode, with assertions compiled in that verify that my "tweaked"/"optimized" code is actually producing the same results for all test cases as the original code (that is presumably known to be working/correct). In this case, an assertion immediately fired. It turns out that the CheckParity routine written in assembly language does not return identical results to the parity64 routine written in C! Uh-oh. Well, that's another bullet we need to add to the above list:
Have I ensured that my "optimized" code is returning the correct results?
This one is especially critical, because it's easy to make something faster if you also make it wrong. :-) I jest, but not entirely, because I've done this many times in the pursuit of faster code.
I believe Michael Petch has already pointed out the reason for the discrepancy: in the x86 implementation, the parity flag (PF) only concerns itself with the bits in the low byte, not the entire value. If that's all you need, then great. But even then, we can go back to the C code and further optimize it to do less work, which will make it faster—perhaps faster than the assembly code, eliminating the one advantage that inline assembly ever had.
For now, let's assume that you need the parity of the full value, since that's the original implementation you had that was working, and you're just trying to make it faster without changing its behavior. Thus, we need to fix the assembly code's logic before we can even proceed with meaningfully benchmarking it. Fortunately, since I am writing this answer late, Ajay Brahmakshatriya (with collaboration from others) has already done that work, saving me the extra effort.
…except, not quite. When I first drafted this answer, my benchmark revealed that draft 9 of his "tweaked" code still did not produce the same result as the original C function, so it's unsuitable according to our test cases. You say in a comment that his code "works" for you, which means either (A) the original C code was doing extra work, making it needlessly slow, meaning that you can probably tweak it to beat the inline assembly at its own game, or worse, (B) you have insufficient test cases and the new "optimized" code is actually a bug lying in wait. Since that time, Ped7g suggested a couple of fixes, which both fixed the bug causing the incorrect result to be returned, and further improved the code. The amount of input required here, and the number of drafts that he has gone through, should serve as testament to the difficulty of writing correct inline assembly to beat the compiler. But we're not even done yet! His inline assembly remains incorrectly written. SETcc instructions require an 8-bit register as their operand, but his code doesn't use a register specifier to request that, meaning that the code either won't compile (because Clang is smart enough to detect this error) or will compile on GCC but won't execute properly because that instruction has an invalid operand.
Have I convinced you about the importance of testing yet? I'll take it on faith, and move on to the benchmarking part. The benchmark results use the final draft of Ajay's code, with Ped7g's improvements, and my additional tweaks. I also compare some of the other solutions from that question you linked, modified for 64-bit integers, plus a couple of my own invention. Here are my benchmark results (mobile Haswell i7-4850HQ):
Benchmark Time CPU Iterations
-------------------------------------------------------------------
Naive 36 ns 36 ns 19478261
OriginalCCode 4 ns 4 ns 194782609
Ajay_Brahmakshatriya_Tweaked 4 ns 4 ns 194782609
Shreyas_Shivalkar 37 ns 37 ns 17920000
TypeIA 5 ns 5 ns 154482759
TypeIA_Tweaked 4 ns 4 ns 160000000
has_even_parity 227 ns 229 ns 3200000
has_even_parity_Tweaked 36 ns 36 ns 19478261
GCC_builtin_parityll 4 ns 4 ns 186666667
PopCount 3 ns 3 ns 248888889
PopCount_Downlevel 5 ns 5 ns 100000000
Now, keep in mind that these are for randomly-generated 64-bit input values, which disrupts branch prediction. If your input values are biased in a predictable way, either towards parity or non-parity, then the branch predictor will work for you, rather than against you, and certain approaches may be faster. This underscores the importance of benchmarking against data that simulates real-world use cases. (That said, when I write general library functions, I tend to optimize for random inputs, balancing size and speed.)
Notice how the original C function compares to the others. I'm going to make the claim that optimizing it any further is probably a big fat waste of time. So hopefully you learned something more general from this answer, rather than just scrolled down to copy-paste the code snippets. :-)
The Naive function is a completely unoptimized sanity check to determine the parity, taken from here. I used it to validate even your original C code, and also to provide a baseline for the benchmarks. Since it loops through each bit, one-by-one, it is relatively slow, as expected:
unsigned int Naive(uint64 n)
{
bool parity = false;
while (n)
{
parity = !parity;
n &= (n - 1);
}
return parity;
}
OriginalCCode is exactly what it sounds like—it's the original C code that you had, as shown in the question. Notice how it posts up at exactly the same time as the tweaked/corrected version of Ajay Brahmakshatriya's inline assembly code! Now, since I ran this benchmark in MSVC, which doesn't support inline assembly for 64-bit builds, I had to use an external assembly module containing the function, and call it from there, which introduced some additional overhead. With GCC's inline assembly, the compiler probably would have been able to inline the code, thus eliding a function call. So on GCC, you might see the inline-assembly version be up to a nanosecond faster (or maybe not). Is that worth it? You be the judge. For reference, this is the code I tested for Ajay_Brahmakshatriya_Tweaked:
Ajay_Brahmakshatriya_Tweaked PROC
mov rax, rcx ; Windows 64-bit calling convention passes parameter in ECX (System V uses EDI)
shr rax, 32
xor rcx, rax
mov rax, rcx
shr rax, 16
xor rcx, rax
mov rax, rcx
shr rax, 8
xor eax, ecx ; Ped7g's TEST is redundant; XOR already sets PF
setnp al
movzx eax, al
ret
Ajay_Brahmakshatriya_Tweaked ENDP
The function named Shreyas_Shivalkar is from his answer here, which is just a variation on the loop-through-each-bit theme, and is, in keeping with expectations, slow:
Shreyas_Shivalkar PROC
; unsigned int parity = 0;
; while (x != 0)
; {
; parity ^= x;
; x >>= 1;
; }
; return (parity & 0x1);
xor eax, eax
test rcx, rcx
je SHORT Finished
Process:
xor eax, ecx
shr rcx, 1
jne SHORT Process
Finished:
and eax, 1
ret
Shreyas_Shivalkar ENDP
TypeIA and TypeIA_Tweaked are the code from this answer, modified to support 64-bit values, and my tweaked version. They parallelize the operation, resulting in a significant speed improvement over the loop-through-each-bit strategy. The "tweaked" version is based on an optimization originally suggested by Mathew Hendry to Sean Eron Anderson's Bit Twiddling Hacks, and does net us a tiny speed-up over the original.
unsigned int TypeIA(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n ^= n >> 2;
n ^= n >> 1;
return !((~n) & 1);
}
unsigned int TypeIA_Tweaked(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n &= 0xf;
return ((0x6996 >> n) & 1);
}
has_even_parity is based on the accepted answer to that question, modified to support 64-bit values. I knew this would be slow, since it's yet another loop-through-each-bit strategy, but obviously someone thought it was a good approach. It's interesting to see just how slow it actually is, even compared to what I termed the "naïve" approach, which does essentially the same thing, but faster, with less-complicated code.
unsigned int has_even_parity(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
if (n & (b << i)) { ++count; }
}
return (count % 2);
}
has_even_parity_Tweaked is an alternate version of the above that saves a branch by taking advantage of the fact that Boolean values are implicitly convertible into 0 and 1. It is substantially faster than the original, clocking in at a time comparable to the "naïve" approach:
unsigned int has_even_parity_Tweaked(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
count += static_cast<int>(static_cast<bool>(n & (b << i)));
}
return (count % 2);
}
Now we get into the good stuff. The function GCC_builtin_parityll consists of the assembly code that GCC would emit if you used its __builtin_parityll intrinsic. Several others have suggested that you use this intrinsic, and I must echo their endorsement. Its performance is on par with the best we've seen so far, and it has a couple of additional advantages: (1) it keeps the code simple and readable (simpler than the C version); (2) it is portable to different architectures, and can be expected to remain fast there, too; (3) as GCC improves its implementation, your code may get faster with a simple recompile. You get all the benefits of inline assembly, without any of the drawbacks.
GCC_builtin_parityll PROC ; GCC's __builtin_parityll
mov edx, ecx
shr rcx, 32
xor edx, ecx
mov eax, edx
shr edx, 16
xor eax, edx
xor al, ah
setnp al
movzx eax, al
ret
GCC_builtin_parityll ENDP
PopCount is an optimized implementation of my own invention. To come up with this, I went back and considered what we were actually trying to do. The definition of "parity" is an even number of set bits. Therefore, it can be calculated simply by counting the number of set bits and testing to see if that count is even or odd. That's two logical operations. As luck would have it, on recent generations of x86 processors (Intel Nehalem or AMD Barcelona, and newer), there is an instruction that counts the number of set bits—POPCNT (population count, or Hamming weight)—which allows us to write assembly code that does this in two operations.
(Okay, actually three instructions, because there is a bug in the implementation of POPCNT on certain microarchitectures that creates a false dependency on its destination register, and to ensure we get maximum throughput from the code, we need to break this dependency by pre-clearing the destination register. Fortunately, this a very cheap operation, one that can generally be handled for "free" by register renaming.)
PopCount PROC
xor eax, eax ; break false dependency
popcnt rax, rcx
and eax, 1
ret
PopCount ENDP
In fact, as it turns out, GCC knows to emit exactly this code for the __builtin_parityll intrinsic when you target a microarchitecture that supports POPCNT (otherwise, it uses the fallback implementation shown below). As you can see from the benchmarks, this is the fastest code yet. It isn't a major difference, so it's unlikely to matter unless you're doing this repeatedly within a tight loop, but it is a measurable difference and presumably you wouldn't be optimizing this so heavily unless your profiler indicated that this was a hot-spot.
But the POPCNT instruction does have the drawback of not being available on older processors, so I also measured a "fallback" version of the code that does a population count with a sequence of universally-supported instructions. That is the PopCount_Downlevel function, taken from my private library, originally adapted from this answer and other sources.
PopCount_Downlevel PROC
mov rax, rcx
shr rax, 1
mov rdx, 5555555555555555h
and rax, rdx
sub rcx, rax
mov rax, 3333333333333333h
mov rdx, rcx
and rcx, rax
shr rdx, 2
and rdx, rax
add rdx, rcx
mov rcx, 0FF0F0F0F0F0F0F0Fh
mov rax, rdx
shr rax, 4
add rax, rdx
mov rdx, 0FF01010101010101h
and rax, rcx
imul rax, rdx
shr rax, 56
and eax, 1
ret
PopCount_Downlevel ENDP
As you can see from the benchmarks, all of the bit-twiddling instructions that are required here exact a cost in performance. It is slower than POPCNT, but supported on all systems and still reasonably quick. If you needed a bit count anyway, this would be the best solution, especially since it can be written in pure C without the need to resort to inline assembly, potentially yielding even more speed:
unsigned int PopCount_Downlevel(uint64 n)
{
uint64 temp = n - ((n >> 1) & 0x5555555555555555ULL);
temp = (temp & 0x3333333333333333ULL) + ((temp >> 2) & 0x3333333333333333ULL);
temp = (temp + (temp >> 4)) & 0x0F0F0F0F0F0F0F0FULL;
temp = (temp * 0x0101010101010101ULL) >> 56;
return (temp & 1);
}
But run your own benchmarks to see if you wouldn't be better off with one of the other implementations, like OriginalCCode, which simplifies the operation and thus requires fewer total instructions. Fun fact: Intel's compiler (ICC) always uses a population count-based algorithm to implement __builtin_parityll; it emits a POPCNT instruction if the target architecture supports it, or otherwise, it simulates it using essentially the same code as I've shown here.
Or, better yet, just forget the whole complicated mess and let your compiler deal with it. That's what built-ins are for, and there's one for precisely this purpose.
Because C sucks when handling bit operations, I suggest using gcc built in functions, in this case __builtin_parityl(). See:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
You will have to use extended inline assembly (which is a gcc extension) to get the similar effect.
Your parity64 function can be changed as follows -
uint parity64_unsafe_and_broken(uint64 n){
uint result = 0;
__asm__("addq $0, %0" : : "r"(n) :);
// editor's note: compiler-generated instructions here can destroy EFLAGS
// Don't depending on FLAGS / regs surviving between asm statements
// also, jumping out of an asm statement safely requires asm goto
__asm__("jnp 1f");
__asm__("movl $1, %0" : "=r"(result) : : );
__asm__("1:");
return result;
}
But as commented by #MichaelPetch the parity flag is computed only on the lower 8 bits. So this will work for your if your n is less than 255. For bigger numbers you will have to use the code you mentioned in your question.
To get it working for 64 bits you can collapse the parity of the 32 bit integer into single byte by doing
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
This code will have to be just at the start of the function before the assembly.
You will have to check how it affects the performance.
The most optimized I could get it is
uint parity64(uint64 n){
unsigned char result = 0;
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
__asm__("test %1, %1 \n\t"
"setp %0"
: "+r"(result)
: "r"(n)
:
);
return result;
}
How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
This is an XY problem... You think you need to inline that assembly to gain from its benefits, so you asked about how to inline it... but you don't need to inline it.
You shouldn't include assembly into your C source code, because in this case you don't need to, and the better alternative (in terms of portability and maintainability) is to keep the two pieces of source code separate, compile them separately and use the linker to link them.
In parity64.c you should have your portable version (with a wrapper named bool CheckParity(size_t result)), which you can default to in non-x86/64 situations.
You can compile this to an object file like so: gcc -c parity64.c -o parity64.o
... and then link the object code generated from assembly, with the C code: gcc bindot.c parity64.o -o bindot
In parity64_x86.s you might have the following assembly code from your question:
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
You can compile this to an alternative parity64.o object file object code using gcc with this command: gcc -c parity64_x86.s -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
Similarly, if you wanted to use __builtin_parityl instead (as suggested by hdantes answer, you could (and should) once again keep that code separate (in the same place you keep other gcc/x86 optimisations) from your portable code. In parity64_x86.c you might have:
bool CheckParity(size_t result) {
return __builtin_parityl(result);
}
To compile this, your command would be: gcc -c parity64_x86.c -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
On a side-note, if you'd like to inspect the assembly gcc would produce from this: gcc -S parity64_x86.c
Comments in your assembly indicate that the equivalent function prototype in C would be bool CheckParity(size_t Result), so with that in mind, here's what bindot.c might look like:
extern bool CheckParity(size_t Result);
uint64_t bindot(uint64_t *a, uint64_t *b, size_t entries){
uint64_t parity = 0;
for(size_t i = 0; i < entries; ++i)
parity ^= a[i] & b[i]; // Running sum!
return CheckParity(parity);
}
You can build this and link it to any of the above parity64.o versions like so: gcc bindot.c parity64.o -o bindot...
I highly recommend reading the manual for your compiler, when you have the time...

Lookup table vs switch in C embedded software

In another thread, I was told that a switch may be better than a lookup table in terms of speed and compactness.
So I'd like to understand the differences between this:
Lookup table
static void func1(){}
static void func2(){}
typedef enum
{
FUNC1,
FUNC2,
FUNC_COUNT
} state_e;
typedef void (*func_t)(void);
const func_t lookUpTable[FUNC_COUNT] =
{
[FUNC1] = &func1,
[FUNC2] = &func2
};
void fsm(state_e state)
{
if (state < FUNC_COUNT)
lookUpTable[state]();
else
;// Error handling
}
and this:
Switch
static void func1(){}
static void func2(){}
void fsm(int state)
{
switch(state)
{
case FUNC1: func1(); break;
case FUNC2: func2(); break;
default: ;// Error handling
}
}
I thought that a lookup table was faster since compilers try to transform switch statements into jump tables when possible.
Since this may be wrong, I'd like to know why!
Thanks for your help!
As I was the original author of the comment, I have to add a very important issue you did not mention in your question. That is, the original was about an embedded system. Presuming this is a typical bare-metal system with integrated Flash, there are very important differences from a PC on which I will concentrate.
Such embedded systems typically have the following constraints.
no CPU cache.
Flash requires waitstates for higher (i.e. >ca. 32MHz) CPU clocks. The actual ratio depends on the die design, low power/high speed process, operating voltage, etc.
To hide waitstates, Flash has wider read-lines than the CPU-bus.
This only works well for linear code with instruction prefetch.
Data accesses disturb instruction prefetch or are stalled until it finished.
Flash might have an internal very small instruction cache.
If any at all, there is an even smaller data-cache.
The small caches result in more frequent trashing (replacing a previous entry before that has been used another time).
For e.g. the STM32F4xx a read takes 6 clocks at 150MHz/3.3V for 128 bits (4 words). So if a data-access is required, chances are good it adds more than 12 clocks delay for all data to be fetched (there are additional cycles involved).
Presuming compact state-codes, for the actual problem, this has the following effects on this architecture (Cortex-M4):
Lookup-table: Reading the function address is a data-access. With all implications mentioned above.
A switch otoh uses a special "table-lookup" instruction which uses code-space data right behind the instruction. So the first entries are possibly already prefetched. Other entries don't break the prefetch. Also the access is a code-acces, thus the data goes into the Flash's instruction cache.
Also note that the switch does not need functions, thus the compiler can fully optimise the code. This is not possible for a lookup table. At least code for function entry/exit is not required.
Due to the aforementioned and other factors, an estimate is hard to tell. It heavily depends on your platform and the code structure. But assuming the system given above, the switch is very likely faster (and clearer, btw.).
First, on some processors, indirect calls (e.g. through a pointer) - like those in your Lookup Table example - are costly (pipeline breakage, TLB, cache effects). It might also be true for indirect jumps...
Then, a good optimizing compiler might inline the call to func1() in your Switch example; then you won't run any prologue or epilogue for an inlined functions.
You need to benchmark to be sure, since a lot of other factors matter on the performance. See also this (and the reference there).
Using a LUT of function pointers forces the compiler to use that strategy. It could in theory compile the switch version to essentially the same code as the LUT version (now that you've added out-of-bounds checks to both). In practice, that's not what gcc or clang choose to do, so it's worth looking at the asm output to see what happened.
(update: gcc -fpie (on by default on most modern Linux distros) likes to make tables of relative offsets, instead of absolute function pointers, so the rodata is position-independent, too. GCC Jump Table initialization code generating movsxd and add?. This could be a missed-optimization, see my answer there for links to gcc bug reports. Manually creating an array of function pointers could work around that.)
I put the code on the Godbolt compiler explorer with both functions in one compilation unit (with gcc and clang output), to see how it actually compiled. I expanded the functions a bit so it wasn't just two cases.
void fsm_switch(int state) {
switch(state) {
case FUNC0: func0(); break;
case FUNC1: func1(); break;
case FUNC2: func2(); break;
case FUNC3: func3(); break;
default: ;// Error handling
}
//prevent_tailcall();
}
void fsm_lut(state_e state) {
if (likely(state < FUNC_COUNT)) // without likely(), gcc puts the LUT on the taken side of this branch
lookUpTable[state]();
else
;// Error handling
//prevent_tailcall();
}
See also
How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?
x86
On x86, clang makes its own LUT for the switch, but the entries are pointers to within the function, not the final function pointers. So for clang-3.7, the switch happens to compile to code that is strictly worse than the manually-implemented LUT. Either way, x86 CPUs tend to have branch prediction that can handle indirect calls / jumps, at least if they're easy to predict.
GCC uses a sequence of conditional branches (but unfortunately doesn't tail-call directly with conditional branches, which AFAICT is safe on x86. It checks 1, <1, 2, 3, in that order, with mostly not-taken branches until it finds a match.
They make essentially identical code for the LUT: bounds check, zero the upper 32-bit of the arg register with a mov, and then a memory-indirect jump with an indexed addressing mode.
ARM:
gcc 4.8.2 with -mcpu=cortex-m4 -O2 makes interesting code.
As Olaf said, it makes an inline table of 1B entries. It doesn't jump directly to the target function, but instead to a normal jump instruction (like b func3). This is a normal unconditional jump, since it's a tail-call.
Each table destination entry needs significantly more code (Godbolt) if fsm_switch does anything after the call (like in this case a non-inline function call, if void prevent_tailcall(void); is declared but not defined), or if this is inlined into a larger function.
## With void prevent_tailcall(void){} defined so it can inline:
## Unlike in the godbolt link, this is doing tailcalls.
fsm_switch:
cmp r0, #3 # state,
bhi .L5 #
tbb [pc, r0] # state
## There's no section .rodata directive here: the table is in-line with the code, so there's no need for base pointer to be loaded into a reg. And apparently it's even loaded from I-cache, not D-cache
.byte (.L7-.L8)/2
.byte (.L9-.L8)/2
.byte (.L10-.L8)/2
.byte (.L11-.L8)/2
.L11:
b func3 # optimized tail-call
.L10:
b func2
.L9:
b func1
.L7:
b func0
.L5:
bx lr # This is ARM's equivalent of an x86 ret insn
IDK if there's much difference between how well branch prediction works for tbb vs. a full-on indirect jump or call (blx), on a lightweight ARM core. A data access to load the table might be more significant than the two-step jump to a branch instruction you get with a switch.
I've read that indirect branches are poorly predicted on ARM. I'd hope it's not bad if the indirect branch has the same target every time. But if not, I'd assume most ARM cores won't find even short patterns the way big x86 cores will.
Instruction fetch/decode takes longer on x86, so it's more important to avoid bubbles in the instruction stream. This is one reason why x86 CPUs have such good branch prediction. Modern branch predictors even do a good job with patterns for indirect branches, based on history of that branch and/or other branches leading up to it.
The LUT function has to spend a couple instructions loading the base address of the LUT into a register, but otherwise is pretty much like x86:
fsm_lut:
cmp r0, #3 # state,
bhi .L13 #,
movw r3, #:lower16:.LANCHOR0 # tmp112,
movt r3, #:upper16:.LANCHOR0 # tmp112,
ldr r3, [r3, r0, lsl #2] # tmp113, lookUpTable
bx r3 # indirect register sibling call # tmp113
.L13:
bx lr #
# in the .rodata section
lookUpTable:
.word func0
.word func1
.word func2
.word func3
See Mike of SST's answer for a similar analysis on a Microchip dsPIC.
msc's answer and the comments give you good hints as to why performance may not be what you expect. Benchmarking is the rule, but results will vary from one architecture to another, and may change with other versions of the compiler and of course its configuration and options selected.
Note however that your 2 pieces of code do not perform the same validation on state:
The switch will gracefully do nothing is state is not one of the defined values,
The jump table version will invoke undefined behavior for all but the 2 values FUNC1 and FUNC2.
There is no generic way to initialize the jump table with dummy function pointers without making assumptions on FUNC_COUNT. Do get the same behavior, the jump table version should look like this:
void fsm(int state) {
if (state >= 0 && state < FUNC_COUNT && lookUpTable[state] != NULL)
lookUpTable[state]();
}
Try benchmarking this and inspect the assembly code. Here is a handy online compiler for this: http://gcc.godbolt.org/#
On the Microchip dsPIC family of devices a look-up table is stored as a set of instruction addresses in the Flash itself. Performing the look-up involves reading the address from the Flash then calling the routine. Making the call adds another handful of cycles to push the instruction pointer and other bits and bobs (e.g. setting the stack frame) of housekeeping.
For example, on the dsPIC33E512MU810, using XC16 (v1.24) the look-up code:
lookUpTable[state]();
Compiles to (from the disassembly window in MPLAB-X):
! lookUpTable[state]();
0x2D20: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D22: ADD W4, W4, W5 ; 1 cycle (addresses are 16 bit aligned)
0x2D24: MOV #0xA238, W4 ; 1 cycle (get base address of look-up table)
0x2D26: ADD W5, W4, W4 ; 1 cycle (get address of entry in table)
0x2D28: MOV [W4], W4 ; 1 cycle (get address of the function)
0x2D2A: CALL W4 ; 2 cycles (push PC+2 set PC=W4)
... and each (empty, do-nothing) function compiles to:
!static void func1()
!{}
0x2D0A: LNK #0x0 ; 1 cycle (set up stack frame)
! Function body goes here
0x2D0C: ULNK ; 1 cycle (un-link frame pointer)
0x2D0E: RETURN ; 3 cycles
This is a total of 11 instruction cycles of overhead for any of the cases, and they all take the same. (Note: If either the table or the functions it contains are not in the same 32K program word Flash page, there will be an even greater overhead due to having to get the Address Generation Unit to read from the correct page, or to set up the PC to make a long call.)
On the other hand, providing that the whole switch statement fits within a certain size, the compiler will generate code that does a test and relative branch as two instructions per case taking three (or possibly four) cycles per case up to the one that's true.
For example, the switch statement:
switch(state)
{
case FUNC1: state++; break;
case FUNC2: state--; break;
default: break;
}
Compiles to:
! switch(state)
0x2D2C: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D2E: SUB W4, #0x0, [W15] ; 1 cycle (compare with first case)
0x2D30: BRA Z, 0x2D38 ; 1 cycle (if branch not taken, or 2 if it is)
0x2D32: SUB W4, #0x1, [W15] ; 1 cycle (compare with second case)
0x2D34: BRA Z, 0x2D3C ; 1 cycle (if branch not taken, or 2 if it is)
! {
! case FUNC1: state++; break;
0x2D38: INC [W14], [W14] ; To stop the switch being optimised out
0x2D3A: BRA 0x2D40 ; 2 cycles (go to end of switch)
! case FUNC2: state--; break;
0x2D3C: DEC [W14], [W14] ; To stop the switch being optimised out
0x2D3E: NOP ; compiler did a fall-through (for some reason)
! default: break;
0x2D36: BRA 0x2D40 ; 2 cycles (go to end of switch)
! }
This is an overhead of 5 cycles if the first case is taken, 7 if the second case is taken, etc., meaning they break even on the fourth case.
This means that knowing your data at design time will have a significant influence on the long-term speed. If you have a significant number (more than about 4 cases) and they all occur with similar frequency then a look-up table will be quicker in the long run. If the frequency of the cases is significantly different (e.g. case 1 is more likely than case 2, which is more likely than case 3, etc.) then, if you order the switch with the most likely case first, then the switch will be faster in the long run. For the edge case when you only have a few cases the switch will (probably) be faster anyway for most executions and is more readable and less error prone.
If there are only a few cases in the switch, or some cases will occur more often than others, then doing the test and branch of the switch will probably take fewer cycles than using a look-up table. On the other hand, if you have more than a handful of cases of that occur with similar frequency then the look-up will probably end up being faster on average.
Tip: Go with the switch unless you know the look-up will definitely be faster and the time it takes to run is important.
Edit: My switch example is a little unfair, as I've ignored the original question and in-lined the 'body' of the cases to highlight the real advantage of using a switch over a look-up. If the switch has to do the call as well then it only has the advantage for the first case!
To have even more compiler outputs, here what is produced by the TI C28x compiler using #PeterCordes sample code:
_fsm_switch:
CMPB AL,#0 ; [CPU_] |62|
BF $C$L3,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#1 ; [CPU_] |62|
BF $C$L2,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#2 ; [CPU_] |62|
BF $C$L1,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#3 ; [CPU_] |62|
BF $C$L4,NEQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
LCR #_func3 ; [CPU_] |66|
; call occurs [#_func3] ; [] |66|
B $C$L4,UNC ; [CPU_] |66|
; branch occurs ; [] |66|
$C$L1:
LCR #_func2 ; [CPU_] |65|
; call occurs [#_func2] ; [] |65|
B $C$L4,UNC ; [CPU_] |65|
; branch occurs ; [] |65|
$C$L2:
LCR #_func1 ; [CPU_] |64|
; call occurs [#_func1] ; [] |64|
B $C$L4,UNC ; [CPU_] |64|
; branch occurs ; [] |64|
$C$L3:
LCR #_func0 ; [CPU_] |63|
; call occurs [#_func0] ; [] |63|
$C$L4:
LCR #_prevent_tailcall ; [CPU_] |69|
; call occurs [#_prevent_tailcall] ; [] |69|
LRETR ; [CPU_]
; return occurs ; []
_fsm_lut:
;* AL assigned to _state
CMPB AL,#4 ; [CPU_] |84|
BF $C$L5,HIS ; [CPU_] |84|
; branchcc occurs ; [] |84|
CLRC SXM ; [CPU_]
MOVL XAR4,#_lookUpTable ; [CPU_U] |85|
MOV ACC,AL << 1 ; [CPU_] |85|
ADDL XAR4,ACC ; [CPU_] |85|
MOVL XAR7,*+XAR4[0] ; [CPU_] |85|
LCR *XAR7 ; [CPU_] |85|
; call occurs [XAR7] ; [] |85|
$C$L5:
LCR #_prevent_tailcall ; [CPU_] |88|
; call occurs [#_prevent_tailcall] ; [] |88|
LRETR ; [CPU_]
; return occurs ; []
I also used -O2 optimizations.
We can see that the switch is not converted into a jump table even if the compiler has the ability.

Which loop has better performance? Increment or decrement? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is it faster to count down than it is to count up?
which loop has better performance? I have learnt from some where that second is better. But want to know reason why.
for(int i=0;i<=10;i++)
{
/*This is better ?*/
}
for(int i=10;i>=0;i--)
{
/*This is better ?*/
}
The second "may" be better, because it's easier to compare i with 0 than to compare i with 10 but I think you can use any one of these, because compiler will optimize them.
I do not think there is much difference between the performance of both loops.
I suppose, it becomes a different situation when the loops look like this.
for(int i = 0; i < getMaximum(); i++)
{
}
for(int i = getMaximum() - 1; i >= 0; i--)
{
}
As the getMaximum() function is called once or multiple times (assuming it is not an inline function)
Decrement loops down to zero can sometimes be faster if testing against zero is optimised in hardware. But it's a micro-optimisation, and you should profile to see whether it's really worth doing. The compiler will often make the optimisation for you, and given that the decrement loop is arguably a worse expression of intent, you're often better off just sticking with the 'normal' approach.
Incrementing and decrementing (INC and DEC, when translated into assembler commands) have the same speed of 1 CPU cycle.
However, the second can be theoretically faster on some (e.g. SPARC) architectures because no 10 has to be fetched from memory (or cache): most architectures have instructions that deal in an optimized fashion when compating with the special value 0 (usually having a special hardwired 0-register to use as operand, so no register has to be "wasted" to store the 10 for each iteration's comparison).
A smart compiler (especially if target instruction set is RISC) will itself detect this and (if your counter variable is not used in the loop) apply the second "decrement downto 0" form.
Please see answers https://stackoverflow.com/a/2823164/1018783 and https://stackoverflow.com/a/2823095/1018783 for further details.
The compiler should optimize both code to the same assembly, so it doesn't make a difference. Both take the same time.
A more valid discussion would be whether
for(int i=0;i<10;++i) //preincrement
{
}
would be faster than
for(int i=0;i<10;i++) //postincrement
{
}
Because, theoretically, post-increment does an extra operation (returns a reference to the old value). However, even this should be optimized to the same assembly.
Without optimizations, the code would look like this:
for ( int i = 0; i < 10 ; i++ )
0041165E mov dword ptr [i],0
00411665 jmp wmain+30h (411670h)
00411667 mov eax,dword ptr [i]
0041166A add eax,1
0041166D mov dword ptr [i],eax
00411670 cmp dword ptr [i],0Ah
00411674 jge wmain+68h (4116A8h)
for ( int i = 0; i < 10 ; ++i )
004116A8 mov dword ptr [i],0
004116AF jmp wmain+7Ah (4116BAh)
004116B1 mov eax,dword ptr [i]
004116B4 add eax,1
004116B7 mov dword ptr [i],eax
004116BA cmp dword ptr [i],0Ah
004116BE jge wmain+0B2h (4116F2h)
for ( int i = 9; i >= 0 ; i-- )
004116F2 mov dword ptr [i],9
004116F9 jmp wmain+0C4h (411704h)
004116FB mov eax,dword ptr [i]
004116FE sub eax,1
00411701 mov dword ptr [i],eax
00411704 cmp dword ptr [i],0
00411708 jl wmain+0FCh (41173Ch)
so even in this case, the speed is the same.
Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.
Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.
The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.
Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.

Resources