To learn more about the CPU and code optimization I have started to study Assembly programming. I have also read about clever optimizations like "branch prediction" that the CPU does to speed itself up.
My question might seem foolish since I do not know the subject very well yet.
I have a very vague memory that I have read somewhere (on the internet) that goto statements will decrease the performance of a program because it does not work well with the branch prediction in the CPU. This might however just be something that I made up and did not actually read.
I think that it could be true.
I hope this example (in pseudo-C) will clarify why I think that is so:
int function(...) {
VARIABLES DECLARED HERE
if (HERE IS A TEST) {
CODE HERE ...
} else if (ANOTHER TEST) {
CODE HERE ...
} else {
/*
Let us assume that the CPU was smart and predicted this path.
What about the jump to `label`?
Is it possible for the CPU to "pre-fetch" the instructions over there?
*/
goto label;
}
CODE HERE...
label:
CODE HERE...
}
To me it seems like a very complex task. That is because then the CPU will need to look up the place where the goto jumps to inorder to be able to pre-fetch the instructions over there.
Do you know anything about this?
Unconditional branches are not a problem for the branch predictor, because the branch predictor doesn't have to predict them.
They add a bit of complexity to the speculative instruction fetch unit, because the existence of branches (and other instructions which change the instruction pointer) means that instructions are not always fetched in linear order. Of course, this applies to conditional branches too.
Remember, branch prediction and speculative execution are different things. You don't need branch prediction for speculative execution: you can just speculatively execute code assuming that branches are never taken, and if you ever do take a branch, cancel out all the operations from beyond that branch. That would be a particularly stupid thing to do in the case of unconditional branches, but it would keep the logic nice and simple. (IIRC, this was how the first pipelined processors worked.)
(I guess you could have branch prediction without speculative execution, but there wouldn't really be a point to it, since the branch predictor wouldn't have anybody to tell its predictions to.)
So yes, branches -- both conditional and unconditional -- increase the complexity of instruction fetch units. That's okay. CPU architects are some pretty smart people.
EDIT: Back in the bad old days, it was observed that the use of goto statements could adversely affect the ability of the compilers of the day to optimize code. This might be what you were thinking of. Modern compilers are much smarter, and in general are not taken too much aback by goto.
due to 'pipelining' and similar activities,
the branch instruction could actually be placed several instructions
before the location where the actual branch is to occur.
(this is part of the branch prediction logic found in the compiler).
a goto statement is just a jump instruction.
As a side note:
Given structured programming concepts,
code clarity, readability, maintainability considerations, etc;
the 'goto' statement should never be used.
on most CPUs,
any jump/call/return type of instruction will flush the prefetch cache
then reload that cache from the new location, IF the new location
is not already in the cache.
Note: for small loops,
which will always will contain 'at least' one jump instruction,
many CPUs have an internal buffer that the programmer can exploit
to make small loops only perform one prefetch sequence
and therefore execute many orders of magnitude faster.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
From reading this I came across the next two quotes:
First quote:
A typical case of unpredictable branch behavior is when the comparison result is dependent on data.
Second quote:
No Branches Means No Mispredicts
For my project, I work on a dependent data and I perform many if and switch statements. My project is related to Big Data so it has to be as efficient as possible. So I wanted to test it on the data provided by user, to see if branch prediction actually slows down my program or helps. As from reading here:
misprediction delay is between 10 and 20 clock cycles.
What shocked me most was:
Removing the branches not only improves runtime performance of the code, it also helps the compiler to optimize the code.
Why use branch prediction then ?
Is there a way to force the compiler to generate assembly code without branches ? or to disable branch prediction so that CPU? so I can compare both results ?
to see if branch prediction actually slows down my program or helps
Branch prediction doesn't slow down programs. When people talk about the cost of missed predictions, they're talking about how much more expensive a mispredicted branch is compared to a correctly predicted branch.
If branch prediction didn't exist, all branches would be as expensive as a mispredicted one.
So what "misprediction delay is between 10 and 20 clock cycles" really means is that successful branch prediction saves you 10 to 20 cycles.
Removing the branches not only improves runtime performance of the code, it also helps the compiler to optimize the code.
Why use branch prediction then ?
Why use branch prediction over removing branches? You shouldn't. If a compiler can remove branches, it will (assuming optimizations are enabled), and if programmers can remove branches (assuming it doesn't harm readability or it's a performance-critical piece of code), they should.
That hardly makes branch prediction useless though. Even if you remove as much branches as possible from a program, it will still contain many, many branches. So because of this and because of how expensive unpredicted branches are, branch prediction is essential for good performance.
Is there a way to force the compiler to generate assembly code without branches ?
An optimizing compiler will already remove branches from a program when it can (without changing the semantics of the program), but, unless we're talking about a very simple int main() {return 0;}-type program, it's impossible to remove all branches. Loops require branches (unless they're unrolled, but that only works if you know the number of iterations ahead of time) and so do most if- and switch-statements. If you can minimize the number of ifs, switches and loops in your program, great, but you won't be able to remove all of them.
or to disable branch prediction so that CPU? so I can compare both results ?
To the best of my knowledge it is impossible to disable branch prediction on x86 or x86-64 CPUs. And as I said, this would never improve performance (though it might make it predictable, but that's not usually a requirement in the contexts where these CPUs are used).
Modern processors have pipelines which allow the CPU to work a lot faster than it would be able to otherwise. This is a form of parallelism where it starts processing an instruction a few clock cycles before the instruction is actually needed. See here here for more details.
This works great until we hit a branch. Since we are jumping, the work that is in the pipeline is no longer relevant. The CPU then needs to flush the pipeline and restart again. This causes a delay of a few clock cycles until the pipeline is full again. This is known as a pipeline stall.
Modern CPUs are clever enough when it comes to unconditional jumps to follow the jump when filling the pipeline thus preventing the stall. This does not work when it comes to branching since the CPU does not know exactly where the jump is going to go.
Branch Prediction tries to solve this problem by making a guess as to which branch the CPU will follow before fully evaluating the jump. This (when it works) prevents the stall.
Since almost all programming involves making decisions, branching is unavoidable. But one certainly can write code with fewer branches and thus lessen the delays caused by misprediction. Once we are branching, branch prediction at least allows us a chance of getting things right and not having a CPU pipeline stall.
I was recently thinking about branch prediction in modern CPUs.
As far as I understand, branch prediction is necessary, because when executing instructions in a pipeline, we don't know the result of the conditional operation right before taking the branch.
Since I know that modern out-of-order CPUs can execute instructions in any order, as long as the data dependencies between them are met, my question is, can CPUs reorder instructions in such a way that the branch target is already known by the time the CPU needs to take the branch, thus can "anticipate" the branch direction, so it doesn't need to guess at all?
So can the CPU turn this:
do_some_work();
if(condition()) //evaluating here requires the cpu to guess the direction or stall
do_this();
else
do_that();
To this:
bool result = condition();
do_some_work(); //bunch of instructions that take longer than the pipeline length
if(result) //value of result is known, thus decision is always 100% correct
do_this();
else
do_that();
A particular and very common use case would be iterating over collections, where the exit condition is often loop-invariant(since we usually don't modify the collection while iterating over it).
My question is can modern generally CPUs do this, and if so, which particular CPU cores are known to have this feature?
Keep in mind that branch prediction is done so early along the pipe, that you still don't have the instruction decoded, and you can't resolve the data dependency because you don't know which register is used. You may be able to remember that somewhere, but that's not 100% (since your storage capacity/time will be limited), so that's pretty much what your normal branch predictor alreay does - speculate the target based on the instruction pointer alone.
However, pulling the condition evaluation earlier is useful, it's been done in the past, and is mostly a compiler technique, but may be enhanced with some HW support (e.g. - hoisting branch condition). The main performance impact of the branch misprediction is the delay in evaluation though, since the branch recovery itself these days is pretty short.
This means that you can mitigate most of the penalty with a compiler hoisting the condition only and calculating this earlier, and without any HW modification - you're still paying the penalty of the flush in case you mispredicted the branch (and the odds are usually low with contemporary predictors), but you'll know that immediately upon decoding the branch itself (since the data will be ready in advance), so the damage will be limited to only a very few instructions that made it down the pipe past that branch.
Being able to hoist the evaluation isn't simple though. The compiler may be able to detect if there are any direct data dependencies in most cases (with do_some_work() in your example), but in most cases there will be. Loop invariants are one of the first things the compiler already moves today. In addition, some of the most hard-to-predict branches depend on some memory fetch, and you usually can't assume memory will stay the same (you can, with some special checks afterward, but most common compilers don't do that). Either way, it's still a compiler technique, and not a fundamental change in branch prediction.
Branch prediction is done because the CPU's instruction fetcher needs to know which instructions to fetch after a branch instruction, and this is not known until after the branch executes.
If a processor has a 5 stage pipeline (most processors have more) like this:
Instruction fetch
Instruction decode
Register read
ALU execution
Register write back
the fetcher will stall for 3 cycles because the branch result won't be known until after the ALU execution cycle.
Hoisting the branch test condition does not address the latency from fetching a branch instruction to its execution.
Is a pointer indirection (to fetch a value) more costly than a conditional?
I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.
I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.
Does anyone have solid data (or a justifiable opinion) on the matter?
EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.
If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.
This is very much dependant on the circumstances.
1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?
A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".
2 What processor is it?
Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.
3 How "predictable" is the condition?
Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.
4 How "busy" and "dirty" is the cache?
If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.
The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.
On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.
And finally, of course, the compiler USUALLY knows quite well what works best... ;)
In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.
It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.
My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.
It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.
In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.
It all boils down to: test it yourself. The answer is not definitive for all problems.
Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.
Conditional instructions are bad efficiency wise because they make the process flow unpredictable.
I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely changed my outlook on programming when dealing with branches that are based on Data.
I currently have a fairly basic, but fully functioning interpreted Intel 8080 Emulator written in C, the heart of the operation is a 256 long switch-case table for handling each opcode. My initial thought was this would obviously be the fastest method of working as opcode encoding isn't consistent throughout the 8080 instruction set and decoding would add a lot of complexity, inconsistency and one-off cases. A switch-case table full of pre-processor macros is a very neat and easy to maintain.
Unfortunately, after reading the aforementioned post it occurred to me that there's absolutely no way the branch predictor in my computer can predict the jumping for the switch case. Thus every time the switch-case is navigated the pipeline would have to be completely wiped, resulting in a several cycle delay in what should otherwise be an incredibly quick program (There's not even so much as multiplication in my code).
I'm sure most of you are thinking "Oh, the solution here is simple, move to dynamic recompilation". Yes, this does seem like it would cut out the majority of the switch-case and increase speed considerably. Unfortunately my primary interest is emulating older 8-bit and 16-bit era consoles (the intel 8080 here is only an example as it's my simplest piece of emulated code) where cycle and timing keeping to the exact instruction is important as the Video and Sound must be processed based on these exact timings.
When dealing with this level of accuracy performance becomes an issue, even for older consoles (Look at bSnes for example). Is there any recourse or is this simply a matter-of-fact when dealing with processors with long pipelines?
On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..
I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.
If in doubt, implement other methods and measure performance.
Edit
First of all, make sure you don't confuse branch prediction and branch target prediction.
Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.
Branch target prediction on the other hand tries to guess where the jump will end up in.
So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".
In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISC, the cost of that computation is much higher than the penalty of a single jump.
As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.
http://www.agner.org/optimize/?e=0,34
The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.
In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.
The indirect jump is probably the best thing to do for instruction decoding.
On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.
On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.
But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:
Instead of generating something that looks like
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
jmp loop
label_instruction_01h_SUB: ...
jmp loop
...
generate the code as
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_01h_SUB: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
...
i.e. replace the jump to the top of the instruction fetch/decode/execute loop
by the code at the top of the loop at each place.
It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.
Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.
Good luck and have fun!
(In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.
Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.
A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).
Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).
I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.
Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz
You may be able to see the tables used, if you care.
(Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)
Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)
I thought I'd add something since no one mentioned it.
Granted, the indirect jump is likely to be the best option.
However, should you go with the N-compare way, there are two things that come to my mind:
First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.
Second, you could run an analysis on the binary code you want to execute.
You could even do that per binary, before execution, and runtime-patch your emulator.
This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.
But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.
I try to find articles, books or anything about programming without jumps (x86 arch). I know that generally it is impossible but I try to avoid jumps but gcc even with inline func uses jumps many times. Coding only in Assembly is some sort of solution, but writing equivalent of 1000 lines in C is like hell party to my eyes..
Unless your jumps are really random, branch prediction should eliminate most of overhead involved.
I would dedicate more effort to optimizing memory access patterns in order to improve locality and reduce cache misses. These days, memory latency is the major bottleneck to performance.
Another good direction is improving parallelism (using both vectorized SIMD instructions and, if possible, more than one core).
Optimize only performance critical code, and only once you really know it is performance critical. Do not try to optimize jumps only because you read they case a performance hit. Everything causes a performance hit, and the fastest possible code is the code which does nothing. There are other things much worse than jumps.
If you will show a particular example of a jump in the generated code, chance is there will be some way to avoid it, but it is more likely the code you will show will still contain more serious issues.
One particular way how to avoid branches is to use "conditional move" instructions. They can be used e.g. to compute max or min. If you allow the compiler to use SSE architecture, it assumes the CPU also supports CMOV/FCOMI/FCOMIP/FUCOMI/FUCOMIP instructions and will use them (beware: sometimes it may be tricky to make the compiler to do what you want, see e.g. this gamedev.net discussion).
I think you may mean branching. In C there are bit twiddling tricks to use to speed up certain operations
See bit hacks:
http://www-graphics.stanford.edu/~seander/bithacks.html
It is not impossible to code without jumps but it seems pointless to try.
In the end if you need to do something more than once then your choices are:
Loop unrolling (i.e. repeating the code instead of looping).
Somehow get the instruction pointer to visit the same code more than once.
The first approach requiers knowing the number of iterations in advance and doesn't scale and the second involves some sort of jump.
Not knowing what your code looks like, it's hard to give any advice. But I will give it a try.
Before you start optimizing, run a profiling tool to locate the problem areas. After optimizing, run the profiling tool again to see if you actually made it faster.
It's hard to actually remove branches, but you can minimize them by doing loop unrolling.
Someone mentioned conditional move instructions, there's plenty of conditional instructions on the ARM architecture, but if they're not executed they will translate to a NOP and take one cycle each. Not sure how they work on x86. It might actually get slower then using a simple branch depending on how long the pipeline is.
There's a lot of other optimizing tricks you could try before removing branches.