How to cancel branch prediction? [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
From reading this I came across the next two quotes:
First quote:
A typical case of unpredictable branch behavior is when the comparison result is dependent on data.
Second quote:
No Branches Means No Mispredicts
For my project, I work on a dependent data and I perform many if and switch statements. My project is related to Big Data so it has to be as efficient as possible. So I wanted to test it on the data provided by user, to see if branch prediction actually slows down my program or helps. As from reading here:
misprediction delay is between 10 and 20 clock cycles.
What shocked me most was:
Removing the branches not only improves runtime performance of the code, it also helps the compiler to optimize the code.
Why use branch prediction then ?
Is there a way to force the compiler to generate assembly code without branches ? or to disable branch prediction so that CPU? so I can compare both results ?

to see if branch prediction actually slows down my program or helps
Branch prediction doesn't slow down programs. When people talk about the cost of missed predictions, they're talking about how much more expensive a mispredicted branch is compared to a correctly predicted branch.
If branch prediction didn't exist, all branches would be as expensive as a mispredicted one.
So what "misprediction delay is between 10 and 20 clock cycles" really means is that successful branch prediction saves you 10 to 20 cycles.
Removing the branches not only improves runtime performance of the code, it also helps the compiler to optimize the code.
Why use branch prediction then ?
Why use branch prediction over removing branches? You shouldn't. If a compiler can remove branches, it will (assuming optimizations are enabled), and if programmers can remove branches (assuming it doesn't harm readability or it's a performance-critical piece of code), they should.
That hardly makes branch prediction useless though. Even if you remove as much branches as possible from a program, it will still contain many, many branches. So because of this and because of how expensive unpredicted branches are, branch prediction is essential for good performance.
Is there a way to force the compiler to generate assembly code without branches ?
An optimizing compiler will already remove branches from a program when it can (without changing the semantics of the program), but, unless we're talking about a very simple int main() {return 0;}-type program, it's impossible to remove all branches. Loops require branches (unless they're unrolled, but that only works if you know the number of iterations ahead of time) and so do most if- and switch-statements. If you can minimize the number of ifs, switches and loops in your program, great, but you won't be able to remove all of them.
or to disable branch prediction so that CPU? so I can compare both results ?
To the best of my knowledge it is impossible to disable branch prediction on x86 or x86-64 CPUs. And as I said, this would never improve performance (though it might make it predictable, but that's not usually a requirement in the contexts where these CPUs are used).

Modern processors have pipelines which allow the CPU to work a lot faster than it would be able to otherwise. This is a form of parallelism where it starts processing an instruction a few clock cycles before the instruction is actually needed. See here here for more details.
This works great until we hit a branch. Since we are jumping, the work that is in the pipeline is no longer relevant. The CPU then needs to flush the pipeline and restart again. This causes a delay of a few clock cycles until the pipeline is full again. This is known as a pipeline stall.
Modern CPUs are clever enough when it comes to unconditional jumps to follow the jump when filling the pipeline thus preventing the stall. This does not work when it comes to branching since the CPU does not know exactly where the jump is going to go.
Branch Prediction tries to solve this problem by making a guess as to which branch the CPU will follow before fully evaluating the jump. This (when it works) prevents the stall.
Since almost all programming involves making decisions, branching is unavoidable. But one certainly can write code with fewer branches and thus lessen the delays caused by misprediction. Once we are branching, branch prediction at least allows us a chance of getting things right and not having a CPU pipeline stall.

Related

Tell branch predictor which way to go every n iterations [duplicate]

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?
Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.
There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.
Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.
There are multiple reasons it doesn't work as expected.
The processor can predict small loops (n<64) perfectly.
The processor can predict small repeating patterns (n~7) perfectly.
The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.
Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.
Yes. http://kerneltrap.org/node/4705
The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.
Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.
Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.
Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).
What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.
See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)
The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).
The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)
See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.
Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).
Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.
GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?
Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.
https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)
I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.
Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.
I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.
SUN C Studio has some pragmas defined for this case.
#pragma rarely_called ()
This works if one part of a conditional expression is a function call or starts with a function call.
But there is no way to tag a generic if/while statement
No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.
Also, obligatory comment about premature optimization and how it's evil.
EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.
This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.
Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.
EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

Branch "anticipation" in modern CPUs

I was recently thinking about branch prediction in modern CPUs.
As far as I understand, branch prediction is necessary, because when executing instructions in a pipeline, we don't know the result of the conditional operation right before taking the branch.
Since I know that modern out-of-order CPUs can execute instructions in any order, as long as the data dependencies between them are met, my question is, can CPUs reorder instructions in such a way that the branch target is already known by the time the CPU needs to take the branch, thus can "anticipate" the branch direction, so it doesn't need to guess at all?
So can the CPU turn this:
do_some_work();
if(condition()) //evaluating here requires the cpu to guess the direction or stall
do_this();
else
do_that();
To this:
bool result = condition();
do_some_work(); //bunch of instructions that take longer than the pipeline length
if(result) //value of result is known, thus decision is always 100% correct
do_this();
else
do_that();
A particular and very common use case would be iterating over collections, where the exit condition is often loop-invariant(since we usually don't modify the collection while iterating over it).
My question is can modern generally CPUs do this, and if so, which particular CPU cores are known to have this feature?
Keep in mind that branch prediction is done so early along the pipe, that you still don't have the instruction decoded, and you can't resolve the data dependency because you don't know which register is used. You may be able to remember that somewhere, but that's not 100% (since your storage capacity/time will be limited), so that's pretty much what your normal branch predictor alreay does - speculate the target based on the instruction pointer alone.
However, pulling the condition evaluation earlier is useful, it's been done in the past, and is mostly a compiler technique, but may be enhanced with some HW support (e.g. - hoisting branch condition). The main performance impact of the branch misprediction is the delay in evaluation though, since the branch recovery itself these days is pretty short.
This means that you can mitigate most of the penalty with a compiler hoisting the condition only and calculating this earlier, and without any HW modification - you're still paying the penalty of the flush in case you mispredicted the branch (and the odds are usually low with contemporary predictors), but you'll know that immediately upon decoding the branch itself (since the data will be ready in advance), so the damage will be limited to only a very few instructions that made it down the pipe past that branch.
Being able to hoist the evaluation isn't simple though. The compiler may be able to detect if there are any direct data dependencies in most cases (with do_some_work() in your example), but in most cases there will be. Loop invariants are one of the first things the compiler already moves today. In addition, some of the most hard-to-predict branches depend on some memory fetch, and you usually can't assume memory will stay the same (you can, with some special checks afterward, but most common compilers don't do that). Either way, it's still a compiler technique, and not a fundamental change in branch prediction.
Branch prediction is done because the CPU's instruction fetcher needs to know which instructions to fetch after a branch instruction, and this is not known until after the branch executes.
If a processor has a 5 stage pipeline (most processors have more) like this:
Instruction fetch
Instruction decode
Register read
ALU execution
Register write back
the fetcher will stall for 3 cycles because the branch result won't be known until after the ALU execution cycle.
Hoisting the branch test condition does not address the latency from fetching a branch instruction to its execution.

How does "goto" statements affect the "branch prediction" of the CPU?

To learn more about the CPU and code optimization I have started to study Assembly programming. I have also read about clever optimizations like "branch prediction" that the CPU does to speed itself up.
My question might seem foolish since I do not know the subject very well yet.
I have a very vague memory that I have read somewhere (on the internet) that goto statements will decrease the performance of a program because it does not work well with the branch prediction in the CPU. This might however just be something that I made up and did not actually read.
I think that it could be true.
I hope this example (in pseudo-C) will clarify why I think that is so:
int function(...) {
VARIABLES DECLARED HERE
if (HERE IS A TEST) {
CODE HERE ...
} else if (ANOTHER TEST) {
CODE HERE ...
} else {
/*
Let us assume that the CPU was smart and predicted this path.
What about the jump to `label`?
Is it possible for the CPU to "pre-fetch" the instructions over there?
*/
goto label;
}
CODE HERE...
label:
CODE HERE...
}
To me it seems like a very complex task. That is because then the CPU will need to look up the place where the goto jumps to inorder to be able to pre-fetch the instructions over there.
Do you know anything about this?
Unconditional branches are not a problem for the branch predictor, because the branch predictor doesn't have to predict them.
They add a bit of complexity to the speculative instruction fetch unit, because the existence of branches (and other instructions which change the instruction pointer) means that instructions are not always fetched in linear order. Of course, this applies to conditional branches too.
Remember, branch prediction and speculative execution are different things. You don't need branch prediction for speculative execution: you can just speculatively execute code assuming that branches are never taken, and if you ever do take a branch, cancel out all the operations from beyond that branch. That would be a particularly stupid thing to do in the case of unconditional branches, but it would keep the logic nice and simple. (IIRC, this was how the first pipelined processors worked.)
(I guess you could have branch prediction without speculative execution, but there wouldn't really be a point to it, since the branch predictor wouldn't have anybody to tell its predictions to.)
So yes, branches -- both conditional and unconditional -- increase the complexity of instruction fetch units. That's okay. CPU architects are some pretty smart people.
EDIT: Back in the bad old days, it was observed that the use of goto statements could adversely affect the ability of the compilers of the day to optimize code. This might be what you were thinking of. Modern compilers are much smarter, and in general are not taken too much aback by goto.
due to 'pipelining' and similar activities,
the branch instruction could actually be placed several instructions
before the location where the actual branch is to occur.
(this is part of the branch prediction logic found in the compiler).
a goto statement is just a jump instruction.
As a side note:
Given structured programming concepts,
code clarity, readability, maintainability considerations, etc;
the 'goto' statement should never be used.
on most CPUs,
any jump/call/return type of instruction will flush the prefetch cache
then reload that cache from the new location, IF the new location
is not already in the cache.
Note: for small loops,
which will always will contain 'at least' one jump instruction,
many CPUs have an internal buffer that the programmer can exploit
to make small loops only perform one prefetch sequence
and therefore execute many orders of magnitude faster.

Why instrumented C program runs faster?

I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.
Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications "should" in theory only affect cache and not directly the process execution time)
I was suspecting "instruction cache effects", maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind --tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.
Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, ...)
Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.
Very few additional work (such as incrementing a counter) might alter compiler's decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.
Incrementing a counter may increase size of some loop and prevent loop unrolling. This may decrease code size (and improve code locality, which is good for instruction or microcode caches or for loop buffer and allows CPU to fetch/decode instructions quickly).
Incrementing a counter may increase size of some function and prevent inlining. This also may decrease code size.
Incrementing a counter may prevent auto-vectorization, which again may decrease code size.
Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.
If you insert counter-incrementing code in place, full of branch targets, this may make branch targets less dense and improve branch prediction.
If you insert counter-incrementing code in front of some particular branch target, this may make branch target's address better aligned and make code fetch faster.
If you place counter-incrementing code after some data is written but before the same data is loaded again (and store-to-load forwarding did not work for some reason), the load operation may be completed earlier.
Insertion of counter-incrementing code may prevent two conflicting load attempts to the same bank in L1 data cache.
Insertion of counter-incrementing code may alter some CPU scheduler decision and make some execution port available just in time for some performance-critical instruction.
To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.
To investigate CPU effects, use a profiler allowing to inspect processor performance counters.
Just guessing from my experience with embedded compilers, Optimization tools in compilers look for recursive tasks. Perhaps the additional code forced the compiler to see something more recursive and it structured the machine code differently. Compilers do some weird things for optimization. In some languages (Perl I think?) a "not not" conditional is faster to execute than a "true" conditional. Does your debugging tool allow you to single step through a code/assembly comparison? This could add some insight as to what the compiler decided to do with the extra tasks.

Is it possible to tell the branch predictor how likely it is to follow the branch?

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?
Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.
There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.
Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.
There are multiple reasons it doesn't work as expected.
The processor can predict small loops (n<64) perfectly.
The processor can predict small repeating patterns (n~7) perfectly.
The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.
Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.
Yes. http://kerneltrap.org/node/4705
The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.
Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.
Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.
Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).
What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.
See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)
The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).
The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)
See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.
Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).
Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.
GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?
Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.
https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)
I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.
Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.
I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.
SUN C Studio has some pragmas defined for this case.
#pragma rarely_called ()
This works if one part of a conditional expression is a function call or starts with a function call.
But there is no way to tag a generic if/while statement
No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.
Also, obligatory comment about premature optimization and how it's evil.
EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.
This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.
Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.
EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

Resources