Performance of array of functions over if and switch statements - c

I am writing a very performance critical part of the code and I had this crazy idea about substituting case statements (or if statements) with array of function pointers.
Let me demonstrate; here goes the normal version:
while(statement)
{
/* 'option' changes on every iteration */
switch(option)
{
case 0: /* simple task */ break;
case 1: /* simple task */ break;
case 2: /* simple task */ break;
case 3: /* simple task */ break;
}
}
And here is the "callback function" version:
void task0(void) {
/* simple task */
}
void task1(void) {
/* simple task */
}
void task2(void) {
/* simple task */
}
void task3(void) {
/* simple task */
}
void (*task[4]) (void);
task[0] = task0;
task[1] = task1;
task[2] = task2;
task[3] = task3;
while(statement)
{
/* 'option' changes on every iteration */
/* and now we call the function with 'case' number */
(*task[option]) ();
}
So which version will be faster? Is the overhead of the function call eliminating speed benefit over normal switch (or if) statement?
Ofcourse the latter version is not so readable but I am looking for all the speed I can get.
I am about to benchmark this when I get things set up but if someone has an answer already, I wont bother.

I think at the end of the day your switch statements will be the fastest, because function pointers have the "overhead" of the lookup of the function and the function call itself. A switch is just a jmp table straight. It of course depends on different things which only testing can give you an answer to. That's my two cent worth.

The switch statement should be compiled into a branch table, which is essentially the same thing as your array of functions, if your compiler has at least basic optimization capability.

Which version will be faster depends. The naive implementation of switch is a huge if ... else if ... else if ... construction meaning it takes on average O(n) time to execute where n is the number of cases. Your jump table is O(1) so the more different cases there are and the more the later cases are used, the more likely the jump table is to be better. For a small number of cases or for switches where the first case is chosen more frequently than others, the naive implementation is better. The matter is complicated by the fact that the compiler may choose to use a jump table even when you have written a switch if it thinks that will be faster.
The only way to know which you should choose is to performance test your code.

First, I would randomly-pause it a few times, to make certain enough time is spent in this dispatching to even bother optimizing it.
Second, if it is, since each branch spends very few cycles, you want a jump table to get to the desired branch. The reason switch statements exist is to suggest to the compiler that it can generate one if the switch values are compact.
How long is the list of switch values? If it's short, the if-ladder could still be faster, especially if you put the most frequently used codes at the top. An alternative to an if-ladder (that I've never actually seen anyone use) is an if-tree, the code equivalent of a binary tree.
You probably don't want an array of function pointers. Yes, it's an array reference to get the function pointer, but there's several instructions' overhead in calling a function, and it sounds like that could overwhelm the small amount being done inside each function.
In any case, looking at the assembly language, or single-stepping at the instruction level, will give you a good idea how efficient it's being.

A good compiler will compile a switch with cases in a small numerical range as a single conditional to see if the value is in that range (which can sometimes be optimized out) followed by a jumptable jump. This will almost surely be faster than a function call (direct or indirect) because:
A jump is a lot less expensive than a call (which must save call-clobbered registers, adjust the stack, etc.).
The code in the switch statement cases can make use of expression values already cached in registers in the caller.
It's possible that an extremely advanced compiler could determine that the call-via-function pointer only refers to one of a small set of static-linkage functions, and thereby optimize things heavily, maybe even eliminating the calls and replacing them by jumps. But I wouldn't count on it.

I arrived at this post recently since I was wondering the same. I ended up taking the time to try it. It certainly depends greatly on what you're doing, but for my VM it was a decent speed up (15-25%), and allowed me to simplify some code (which is probably where a lot of the speedup came from). As an example (code simplified for clarity), a "for" loop was able to be easily implemented using a for loop:
void OpFor( Frame* frame, Instruction* &code )
{
i32 start = GET_OP_A(code);
i32 stop_value = GET_OP_B(code);
i32 step = GET_OP_C(code);
// instruction count (ie. block size)
u32 i_count = GET_OP_D(code);
// pointer to end of block (NOP if it branches)
Instruction* end = code + i_count;
if( step > 0 )
{
for( u32 i = start; i < stop_value; i += step )
{
// rewind instruction pointer
Instruction* cur = code;
// execute code inside for loop
while(cur != end)
{
cur->func(frame, cur);
++cur;
}
}
}
else
// same with <=
}

Related

C How extract predefined huge switch from huge loop without loss performance?

I have a bottleneck, which looks like this:
void function(int type) {
for (int i = 0; i < m; i++) {
// do some stuff A
switch (type) {
case 0:
// do some stuff 0
break;
[...]
case n:
// do some stuff n
break;
}
// do some stuff B
}
}
n and m are large enough.
m millions, sometimes hundreds of millions.
n is the 2 ^ 7 - 2 ^ 10 (128 - 1024)
Chunks of code A and B are sufficiently large.
I rewrote the code (via macros) as follows:
void function(int type) {
switch (type) {
case 0:
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff 0
// do some stuff B
}
break;
[...]
case n:
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff n
// do some stuff B
}
break;
}
}
As a result, it looks like this in IDA for this function:
Is there a way to remove the switch from the loop:
without creating a bunch of copies of the loop
not create huge function with macros
without losing performance?
A possible solution seems to me the presence of goto variable. Something like this:
void function(int type) {
label* typeLabel;
switch (type) {
case 0:
typeLabel = &label_1;
break;
[...]
case n:
typeLabel = &label_n;
break;
}
for (int i = 0; i < m; i++) {
// do some stuff A
goto *typeLabel;
back:
// do some stuff B
}
goto end;
label_1:
// do some stuff 0
goto back;
[...]
label_n:
// do some stuff n
goto back;
end:
}
The matter is also complicated by the fact that all of this will be carried out on different Android devices with different speeds.
Architecture as ARM, and x86.
Perhaps this can be done assembler inserts rather than pure C?
EDIT:
I run some tests. n = 45,734,912
loop-within-switch: 891,713 μs
switch-within-loop: 976,085 μs
loop-within-switch 9.5% faster from switch-within-loop
For example: simple realisation without switch takes 1,746,947 μs
At the moment, the best solution I can see is:
Generate with macros n functions, which will look like this:
void func_n() {
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff n
// do some stuff B
}
}
Then make an array of pointers to them, and called from the main function:
void main(int type) {
func* table[n];
// fill table array with pointers to func_0 .. func_n
table[type](); // call appropriate func
}
This allows the optimizer to optimize the compiler function func_0 .. func_n. Moreover, they will not be so big.
Realistically, a static array of labels is likely the fastest sane option (array of pointers being the sanest fast option). But, let's get creative.
(Note that this should have been a comment, but I need the space).
Option 1: Exploit the branch predictor
Let's build on the fact that if a certain outcome of a branch happens, the predictor will likely predict the same outcome in the future. Especially if it happens more than once. The code would look something like:
for (int i = 0; i < m; i++)
{
// do some stuff A
if (type < n/2)
{
if (type < n/4)
{
if (type < n/8)
{
if (type == 0) // do some stuff 0
else // do some stuff 1
}
else
{
...
}
}
else
{
...
}
}
else
{
...
// do some stuff n
}
// do some stuff B
}
Basically, you binary search what to do, in log(n) steps. That is a log(n) possible jumps, but after only one or two iterations, the branch predictor will predict them all correctly, and will speculatively execute the proper instructions without problem. Depending on the CPU, this could be faster than a goto *labelType; back: as some are unable to prefetch instructions when the jump address is calculated dynamically.
Option 2: JIT load the proper 'stuff'
So, ideally, your code would look like:
void function(int type) {
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff [type]
// do some stuff B
}
}
With all the other 0..n "stuffs" being junk in the current function invocation. Well, let's make it like that:
void function(int type) {
prepare(type);
for (int i = 0; i < m; i++) {
// do some stuff A
reserved:
doNothing(); doNothing(); doNothing(); doNothing(); doNothing();
// do some stuff B
}
}
The doNothing() calls are there just to reserve the space in the function. Best implementation would be goto B. The prepare(type) function will look in the lookup table for all the 0..n implementations, take the type one, and copy it over all those goto Bs. Then, when you are actually executing your loop, you have the optimal code where there are no needless jumps.
Just be sure to have some final goto B instruction in the stuff implementation - copying a smaller one over a larger one could cause problems otherwise. Alternatively, before exiting function you can restore all the placeholder goto B; instructions. It's a small cost, since you're only doing it once per invocation, not per iteration.
prepare() would be much easier to implement in assembly than in C, but it is doable. You just need the start/end addresses of all stuff_i implementations (in your post, these are label_[i] and label_[i+1]), and memcpy that into reserved.
Maybe the compiler will even let you do:
memcpy((uint8_t*)reserved, (uint8_t*)label_1, (uint8_t*)label_2 - (uint8_t*)label_1);
Likely not, though. You can, however, get the proper locations using setjmp or something like __builtin_return_address / _ReturnAddress within a function call.
Note that this will require write access to the instruction memory. Getting that is OS specific, and likely requires su/admin privileges.
The compiler is generally good at choosing an optimal form of the switch. For an ARM device you can have a few forms for a dense code snippets. Either a branch table (like a bunch of function pointers) or if the code in the switch is near identical you may do an array index. Semantically something like this,
dest = &first_switch_pc;
dest += n*switch_code_size;
current_pc = dest;
An ARM CPU may do this in a single instruction. This is probably not profitable in your case as the type seems to be constant per loop iteration.
However, I would definitely explore restructuring your code like this,
void function(int type) {
i = 0;
if (m==0) return;
// initialize type_label;
goto entry;
while(1) {
// do some stuff B
i++;
if(i < m) break;
entry:
// do some stuff A
goto *type_label;
label_1:
// do some stuff 0
continue;
[...]
label_n:
// do some stuff n
continue;
}
}
This will merge the 'A' and 'B' so that it will fit well in the code cache. The 'control flow' from the 'goto label' will then be to the top of the loop. You maybe able to simplify the control flow logic depending on how i is used in the unknown snippets. A compiler may do this for you automatically depending on optimization levels, etc. No one can really give an answer without more information and profiling. The cost of 'stuff A', 'stuff B' and the size of the switch snippets are all important. Examining the assembler output is always helpful.
This pdf of slides from a presentation about getting gcc to thread jumps is interesting. This is the exact optimization gcc needs to do to compile the switch-inside-loop version similarly to the loop-inside-switch version.
BTW, the loop-inside-switch version should be equivalent in performance to the loop-inside-separate-functions version. Cache operates in terms of cache lines, not whole functions. If most of the code in a function never runs, it doesn't matter that it's there. Only the code that does run takes space in the cache.
If all ARM cores in Android devices have branch-target prediction for indirect jumps, your second implementation of doing the compiler's job for it, and doing an indirect goto inside the loop, is probably the best tradeoff between code size and performance. A correctly-predicted unconditional indirect branch costs about the same as a couple add instructions on x86. If ARM is similar, the savings in code size should pay for it. Those slides talk about some ARM cores having indirect-branch prediction, but doesn't say that all do.
This Anandtech article about A53 cores (the little cores in big.LITTLE) says that A53 vastly increased the indirect-branch prediction resources compared to A7. A7 cores have an 8-entry indirect branch target buffer. That should be enough to make the goto *label in your loop efficient, even on very weak LITTLE cores, unless the rest of your loop has some indirect branches inside the loop. One mispredict on the occasional iteration should only cost maybe 8 cycles. (A7 has a short 8-stage pipeline, and is "partial dual issue, in-order", so branch mispredicts are cheaper than on more powerful CPUs.
Smaller code size means less code to be loaded from flash, and also less I-cache pressure if the function is called with different arguments for type while the do stuff for A and do stuff for B code is still present in I$, and has its branch-prediction history still available.
If the do stuff for [type] code changes how branches in the stuff for A and B code behaves, it may be best to have the entire loop body duplicated, so different copies of the branch have their own prediction entries.
If you want to sort out what's actually slow, you're going to have to profile your code. If ARM is like x86 in having hardware performance counters, you should be able to see which instructions are taking a lot of cycles. Also actually count branch mispredicts, I$ misses, and lots of other stuff.
To make any further suggestions, we'd need to see how big your pieces of code are, and what sort of thing they're doing. Clearly you think loop and switch overhead are making this hot function more of a bottleneck than it needs to be, but you haven't actually said that loop-inside-switch gave better performance.
Unless all the do stuff A, do stuff B, and many of the do stuff [type] blocks are very small, the switch is probably not the problem. If they are small, then yes, it is probably worth duplicating the loop N times.
Another solution is use labels as values:
void function(int type) {
void *type_switch = &&type_break;
switch (type) {
case 0:
type_switch = &&type_0;
break;
[...]
case n:
type_switch = &&type_n;
break;
}
for (int i = 0; i < m; i++) {
// do some stuff A
goto *type_switch;
type_0: {
// do some stuff 0
goto type_break;
}
[...]
type_n: {
// do some stuff n
goto type_break;
}
type_break: ;
// do some stuff B
}
}
This solution is worse than the version with lots of functions.
If not enabled the optimization of the code, the variables will be loaded each time from the stack in the parts of code 0 .. n.
Address goto can also be loaded each time from the stack.
Two extra goto.

Using hardware timer in C

Okay, so I've got some C code to perform a mathematical operation which could, pretty much, take any length of time (depending on the operands supplied to it, of course). I was wondering if there is a way to register some kind of method which will be called every n seconds which can analyse the state of the operation, i.e. what iteration it is currently at, possibly using a hardware timer interrupt or something?
The reason I ask this is because I know the common way to implement this is to be keeping track of the current iteration in a variable; say, an integer called progress and have an IF statement like this in the code:
if ((progress % 10000) == 0)
printf("Currently at iteration %d\n", progress);
but I believe that a mod operation takes a relatively long time to execute, so the idea of having it inside a loop which will be ran many, many times scares me, from an optimisation point of view.
So I get the feeling that having an external way of signalling a progress print is nice and efficient. Are there any great ways to perform this, or is the simple 'mod check' the best (in terms of optimising)?
I'd go with the mod check, but maybe with subtractions instead :-)
icount = 0;
progress = 10000;
/* ... */
if (--progress == 0) {
progress = 10000;
printf("Currently at iteration %d0000\n", ++icount);
}
/* ... */
While mod operations are usually slow, the compiler should be able to optimize and predict this really well and only mis-predict once ever 10'000 ifs, burning one mod operation and ~20 cycles (for the mis-prediction) on it, which is fine. So you are trying to optimize one mod operation every 10'000 iterations. Of course this assumes you are running it on a modern and typical CPU, and not some embedded system with unknown specs. This should even be faster than having a counter variable.
Suggestion: Test it with and without the timing code, and figure out a complex solution if there is really a problem.
Premature optimisation is the root of all evil. -Knuth
mod is about the same speed as division, on most CPU's these days that means about 5-10 cycles... in other words hardly anything, slower than multiply/add/subtract, but not enough to really worry about.
However you are right to want to avoid sting in a loop spinning if you're doing work in another thread or something like that, if you're on a unixish system there's timer_create() or on linux the much easier to use timerfd_create()
But for single threaded, just putting that if in is enough.
Use alarm setitimer to raise SIGALRM signals at regular intervals.
struct itimerval interval;
void handler( int x ) {
write( STDOUT_FILENO, ".", 1 ); /* Defined in POSIX, not in C */
}
int main() {
signal( SIGALRM, &handler );
interval.it_value.tv_sec = 5; /* display after 5 seconds */
interval.it_interval.tv_sec = 5; /* then display every 5 seconds */
setitimer( ITIMER_REAL, &interval, NULL );
/* do computations */
interval.it_interval.tv_sec = 0; /* don't display progress any more */
setitimer( ITIMER_REAL, &interval, NULL );
printf( "\n" ); /* done with the dots! */
}
Note, only a smattering of functions are OK to call inside handler. They are listed partway down this page. If you want to communicate anything for a fancier printout, do it through a sig_atomic_t variable.
you could have a global variable for the iterations, which you could monitor from an external thread.
While () {
Print(iteration);
Sleep(1000);
}
You may need to watch out for data races though.

Is this a good implementation of a FPS independant game loop?

I currently have something close to the following implementation of a FPS independent game loop for physics based games. It works very well on just about every computer I have tested it on, keeping the game speed consistent when frame rate drops. However I am going to be porting to embedded devices which will likely struggle harder with video and I am wondering if it will still cut the mustard.
edits:
For this question assume that msecs() returns the time passed in milliseconds which the program has run. The implementation of msecs is different on different platforms. This loop is also run in different ways on different platforms.
#define MSECS_PER_STEP 20
int stepCount, stepSize; // these are not globals in the real source
void loop() {
int i,j;
int iterations =0;
static int accumulator; // the accumulator holds extra msecs
static int lastMsec;
int deltatime = msec() - lastMsec;
lastMsec = msec();
// deltatime should be the time since the last call to loop
if (deltatime != 0) {
// iterations determines the number of steps which are needed
iterations = deltatime/MSECS_PER_STEP;
// save any left over millisecs in the accumulator
accumulator += deltatime%MSECS_PER_STEP;
}
// when the accumulator has gained enough msecs for a step...
while (accumulator >= MSECS_PER_STEP) {
iterations++;
accumulator -= MSECS_PER_STEP;
}
handleInput(); // gathers user input from an event queue
for (j=0; j<iterations; j++) {
// here step count is a way of taking a more granular step
// without effecting the overall speed of the simulation (step size)
for (i=0; i<stepCount; i++) {
doStep(stepSize/(float) stepCount); // forwards the sim
}
}
}
I just have a few comments. The first is that you don't have enough comments. There are places where it's not clear what you are trying to do so it is difficult to say if there is a better way to do it, but I'll point those out as I come to them. First, though:
#define MSECS_PER_STEP 20
int stepCount, stepSize; // these are not globals in the real source
void loop() {
int i,j;
int iterations =0;
static int accumulator; // the accumulator holds extra msecs
static int lastMsec;
These are not initialized to anything. The probably turn up as 0, but you should have initialized them. Also, rather than declaring them as static you might want to consider putting them in a structure that you pass into loop by reference.
int deltatime = msec() - lastMsec;
Since lastMsec wasn't (initialized and is probably 0) this probably starts out as a big delta.
lastMsec = msec();
This line, just like the last line, calls msec. This is probably meant as "the current time", and these calls are close enough that the returned value is probably the same for both calls, which is probably also what you expected, but still, you call the function twice. You should change these lines to int now = msec(); int deltatime = now - lastMsec; lastMsec = now; to avoid calling this function twice. Current time getting functions often have much higher overhead than you think.
if (deltatime != 0) {
iterations = deltatime/MSECS_PER_STEP;
accumulator += deltatime%MSECS_PER_STEP;
}
You should have a comment here that says what this does, as well as a comment above
that says what the variables were meant to mean.
while (accumulator >= MSECS_PER_STEP) {
iterations++;
accumulator -= MSECS_PER_STEP;
}
This loop needs a comment. It also needs to not be there. It appears that it could have been replaced with iterations += accumulator/MSECS_PER_STEP; accumulator %= MSECS_PER_STEP;. The division and modulus should run in shorter and more consistent time than the loop on any machine that has hardware division (which many do).
handleInput(); // gathers user input from an event queue
for (j=0; j<iterations; j++) {
for (i=0; i<stepCount; i++) {
doStep(stepSize/(float) stepCount); // forwards the sim
}
}
Doing steps in a loop independent of input will have the effect of making the game unresponsive if it does execute slow and get behind. It appears, at least, that if the game gets behind all of the input will start to stack up and get executed together and all of the in-game time will pass in one chunk. This is a less than graceful way to fail.
Additionally, I can guess what the j loop (outer loop) means, but the inner loop I am less clear on. also, the value passed to the doStep function -- what does that mean.
}
This is the last curly brace. I think that it looks lonely.
I don't know what goes on as far as whatever calls your loop function, which may be out of your control, and that may dictate what this function does and how it looks, but if not I hope that you will reconsider the structure. I believe that a better way to do it would be to have a function that is called repeatedly but with only one event at the time (issued regularly at a relatively short period). These events can be either user input events or timer events. User input events just set things up to react upon the next timer event. (when you don't have any events to process you sleep)
You should always assume that each timer event is processed at the same period, even though there may be some drift here if the processing gets behind. The main oddity that you may notice here is that if the game gets behind on processing timer events and then catches up again the time within the game may appear to slow down (below real time), then speed up (to real time), and then slow back down (to real time).
Ways to deal with this include only allowing one timer event to be in the event queue at one time, which would result in time appearing to slow down (below real time) and then speed back up (to real time) with no super speed interval.
Another way to do this, which is functionally similar to what you have, would be to have the last step of processing each timer event be to queue up the next timer event (note that no one else should send timer events {except for the first one} if this is the way you choose to implement the game). This would mean doing away with the regular time intervals between timer events and also restrict the ability for the program to sleep, since at the very least every time the event queue were inspected there would be a timer event to process.

Using the C preprocessor to effectively rename variables

I'm writing a few very tight loops and the outermost loop will run for over a month. It's my understanding that the less local variables a function has, the better the compiler can optimize it. In one of the loops, I need a few flags, only one of which is used at a time. If you were the proverbial homicidal maniac that knows where I live, would you rather have the flag named flag and used as such throughout or would you prefer something like
unsigned int flag;
while (condition) {
#define found_flag flag
found_flag = 0;
for (i = 0; i<n; i++) {
if (found_condition) {
found_flag = 1;
break;
}
}
if (!found_flag) {
/* not found action */
}
/* other code leading up to the next loop with flag */
#define next_flag flag
next_flag = 0;
/* ... /*
}
This provides the benefit of allowing descriptive names for each flag without adding a new variable but seems a little unorthodox. I'm a new C programmer so I'm not sure what to do here.
Don't bother doing this, just use a new variable for each flag. The compiler will be able to determine where each one is first and last used and optimise the actual amount of space used accordingly. If none of the usage of the flag variables overlap, then the compiler may end up using the same space for all flag variables anyway.
Code for readability first and foremost.
I completely agree with dreamlax: the compiler will be smart enough for you to ignore this issue entirely, but I'd like to mention that you neglected a third option, which is rather more readable:
while (something) {
/* setup per-loop preconditions */
{
int flag1;
while (anotherthing) {
/* ... */
}
/* deal with flag found or not-found here */
}
/* possibly some other preconditions */
{
int flag2;
while (stillanotherthing) {
/* ... */
}
}
}
which would tell a dumb compiler explicitly when you are done with each flag. Note that you will need to take care about where you declare variables that need to live beyond the flag-scope blocks.
Your trick would only be useful on very old, very simple, or buggy compilers that aren't capable of correct register (re)allocation and scheduling (sometimes, that's what one is stuck with for various or ancient embedded processors). gcc, and most modern compilers, when optimizations are turned on, would reallocate any register or local memory resources used for local variables until they are almost hard to find when debugging at the machine code level. So you might as well make your code readable and not spend brain power on this type of premature optimization.

In C, which is faster: if with returns, or else if with returns?

Is it better to have if / else if, if every block in the if statement returns, or is it better to have a chain of ifs? To be specific, which if fastest:
A:
if (condition1) {
code1;
return a;
}
if (condition2) {
code2;
return b;
}
//etc...
B:
if (condition1) {
code1;
return a;
}
else if (condition2) {
code2;
return b;
}
//etc...
It makes no difference, and this is a needless attempt at micro-optimization.
The C standard does not dictate what machine language gets created based on the C code. You can sometimes make assumptions if you understand the underlying architecture but even that is unwise.
The days are long past where CPUs are simple beasts now that they have pipelining, multiple levels of caches and all sorts of other wondrous things to push their speed to the limit.
You should not be worrying about this level of optimization until you have a specific problem (some would say "at all").
Write your code to be readable.
That should be rule number 1, 2 and 3. Which do you think is the greatest problem in software development, code running at 99.5% of it's maximum speed or developers spending days trying to figure out and/or fix what a colleague (or even themselves) did six months ago?
My advice is to worry about performance only when you find it's a problem, then benchmark on the target platforms to see where the greatest improvement can be gained. A 1% improvement in a if statement is likely to be dwarfed by choosing a better algorithm elsewhere in your code (other things, such as number of times the code is called, being equal of course). Optimization should always be targeted to get the best bang-per-buck.
With those returns, the else is superflous. The compiler is likely smart enough to figure this out.
I suspect the compiler will generate the same code for both. Disassemble it and see.
In any case, examining the output of the compiler and empirical performance testing is the only way to be sure.
They should be equivalent on most architectures. The instructions generated are probably still the same bne, cmps and rets.
What might help is if you use a switch/case instead of if statement.
I don't really think it is a big difference if any:
For the A case:
if (condition){
//conditionOp
//cmp ... , ...
//jxx :toEndIf
code;
return bar;
//mov eax, bar
//jmp :toEnd
}
if(condition){
//conditionOp
//cmp ... , ...
//jxx :toEndIf
code;
return bar;
//mov eax, bar
//jmp :toEnd
}
For the B case:
if(condition){
//conditionOp
//cmp ... , ...
//jxx :toElse + 1
code;
return bar;
//mov eax , bar
//jmp :toEnd
} else
//jmp :endElse
if (condition2){
//conditionOp
//cmp ... , ...
//jxx :endElse
code;
return bar;
//mov eax, bar
//jmp :toEnd
}
Thus, using the B case, one extra instruction is added. Though, optimizing for size may get rid of that.
Write a simple test program to measure this and find out - but yes this is needless optimization.
This should perform the same in the optimized builds. If not, then something else is likely preventing the compiler from doing the "right thing".
Robbotic is incorrect. In both instances, if the first clause is true, then the subsiquent statements will not be executed (evaluated).
Note, be sure to measure - you may be optimizing the wrong thing.

Resources