Tell branch predictor which way to go every n iterations [duplicate]

Tell branch predictor which way to go every n iterations [duplicate] - c

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?

Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.
There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.
Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.
There are multiple reasons it doesn't work as expected.
The processor can predict small loops (n<64) perfectly.
The processor can predict small repeating patterns (n~7) perfectly.
The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.
Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.

Yes. http://kerneltrap.org/node/4705
The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.

Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.
Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.
Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).
What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.
See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)
The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).
The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)
See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.
Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).
Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.
GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?
Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.
https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)

I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.
Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.
I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.

SUN C Studio has some pragmas defined for this case.
#pragma rarely_called ()
This works if one part of a conditional expression is a function call or starts with a function call.
But there is no way to tag a generic if/while statement

No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.
Also, obligatory comment about premature optimization and how it's evil.
EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.

This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.
Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.
EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

Related

Is Eigen NEON backend optimized to take advantage of the 2x128b NEON execution units which exist starting from ARM A76?

Going over Eigen documentation, its not clear whether it was updated since the release of A76 CPU core to take advantage of the wider SIMD it contains (2x128b vs. previous 128b)
I am hoping someone from the development team (or an expert user) can help clarifying that.

I'm not familiar with Eigen in particular, but in general, one doesn't need to do much to SIMD code to take advantage of different amounts of hardware execution units - especially when the CPUs support out of order execution, they will pick up more instructions that can be executed in parallel when there's more execution units.
If compiling e.g. SIMD intrinsics with a compiler, the compiler may be able to tune the exact scheduling of code if told to optimize specifically for that core (and if the compiler knows the scheduling characteristics for the core). Same thing for handwritten assembly code - it can be tuned and tweaked a bit for different cores' characteristics, but in most cases, it doesn't change very dramatically; more capable cores will execute it faster.
(The factor that primarily affects the bigger picture of how the code is written, which would require a proper rewrite to take advantage of, is usually the number of registers available in the instruction set - but that doesn't change with a hardware implementation with more execution units.)

large performance drop with gcc, maybe related to inline

I'm currently experiencing some weird effect with gcc (tested version: 4.8.4).
I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.
Since inlining across multiple .c files is difficult (-flto is not yet widely available), I've kept a lot of small functions (typically 1 to 5 lines of code each) into a common C file, into which I'm developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are just comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.
Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).
The strange effect is this one:
I recently added a new function fnew to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c file.
The simple fact that it exists makes the performance of the decoder function fdec drops substantially, by more than 20%, which is way too much to be ignored.
Now, keep in mind than encoding and decoding operations are completely separated, and share almost nothing, save some minor typedef (u32, u16 and such) and associated operations (read/write).
When defining the new encoding function fnew as static, performance of the decoder fdec increases back to normal. Since fnew isn't called from the .c, I guess it's the same as if it was not there (dead code elimination).
If static fnew is now called from the encoder side, performance of fdec remains strong.
But as soon as fnew is modified, fdec performance just drops substantially.
Presuming fnew modifications crossed a threshold, I increased the following gcc parameter: --param max-inline-insns-auto=60 (by default, its value is supposed to be 40.) And it worked : performance of fdec is now back to normal.
And I guess this game will continue forever with each little modification of fnew or anything else similar, requiring further tweak.
This is just plain weird. There is no logical reason for some little modification in function fnew to have knock-on effect on completely unrelated function fdec, which only relation is to be in the same file.
The only tentative explanation I could invent so far is that maybe the simple presence of fnew is enough to cross some kind of global file threshold which would impact fdec. fnew can be made "not present" when it's: 1. not there, 2. static but not called from anywhere 3. static and small enough to be inlined. But it's just hiding the problem. Does that mean I can't add any new function?
Really, I couldn't find any satisfying explanation anywhere on the net.
I was curious to know if someone already experienced some equivalent side-effect, and found a solution to it.
[Edit]
Let's go for some more crazy test.
Now I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew, but the name of the function is obviously different, so let's call it wtf.
When wtf exists, it doesn't matter if fnew is static or not, nor what is the value of max-inline-insns-auto: performance of fdec is back to normal.
Even though wtf is not used nor called from anywhere... :'(
[Edit 2]
there is no inline instruction. All functions are either normal or static. Inlining decision is solely within compiler's realm, which has worked fine so far.
[Edit 3]
As suggested by Peter Cordes, the issue is not related to inline, but to instruction alignment. On newer Intel cpus (Sandy Bridge and later), hot loop benefit from being aligned on 32-bytes boundaries.
Problem is, by default, gcc align them on 16-bytes boundaries. Which gives a 50% chance to be on proper alignment depending on length of previous code. Hence a difficult to understand issue, which "looks random".
Not all loop are sensitive. It only matters for critical loops, and only if their length make them cross one more 32-bytes instruction segment when being less ideally aligned.

Turning my comments into an answer, because it was turning into a long discussion. Discussion showed that the performance problem is sensitive to alignment.
There are links to some perf-tuning info at https://stackoverflow.com/tags/x86/info, include Intel's optimization guide, and Agner Fog's very excellent stuff. Some of Agner Fog's assembly optimization advice doesn't fully apply to Sandybridge and later CPUs. If you want the low-level details on a specific CPU, though, the microarch guide is very good.
Without at least an external link to code that I can try myself, I can't do more than handwave. If you don't post the code anywher, you're going to need to use profiling / CPU performance counter tools like Linux perf or Intel VTune to track this down in a reasonable amount of time.
In chat, the OP found someone else having this issue, but with code posted. This is probably the same issue the OP is seeing, and is one of the major ways code alignment matters for Sandybridge-style uop caches.
There's a 32B boundary in the middle of the loop in the slow version. The instructions that start before the boundary decode to 5 uops. So in the first cycle, the uop cache serves up mov/add/movzbl/mov. In the 2nd cycle, there's only a single mov uop left in the current cache line. Then the 3rd cycle cycle issues the last 2 uops of the loop: add and cmp+ja.
The problematic mov starts at 0x..ff. I guess instructions that span a 32B boundary go into (one of) the uop cacheline(s) for their starting address.
In the fast version, an iteration only takes 2 cycles to issue: The same first cycle, then mov / add / cmp+ja in the 2nd.
If one of the first 4 instructions had been one byte longer (e.g. padded with a useless prefix, or a REX prefix), there would be no problem. There wouldn't be an odd-man-out at the end of the first cacheline, because the mov would start after the 32B boundary and be part of the next uop cache line.
AFAIK, assemble & check disassembly output is the only way to use longer versions of the same instructions (see Agner Fog's Optimizing Assembly) to get 32B boundaries at multiples of 4 uops. I'm not aware of a GUI that shows alignment of assembled code as you're editing. (And obviously, doing this only works for hand-written asm, and is brittle. Changing the code at all will break the hand-alignment.)
This is why Intel's optimization guide recommends aligning critical loops to 32B.
It would be really cool if an assembler had a way to request that preceding instructions be assembled using longer encodings to pad out to a certain length. Maybe a .startencodealign / .endencodealign 32 pair of directives, to apply padding to code between the directives to make it end on a 32B boundary. This could make terrible code if used badly, though.
Changes to the inlining parameter will change the size of functions, and bump other code over by multiples 16B. This is a similar effect to changing the contents of a function: it gets bigger and changes the alignment of other functions.
I was expecting the compiler to always make sure a function starts at
ideal aligned position, using noop to fill gaps.
There's a tradeoff. It would hurt performance to align every function to 64B (the start of a cache line). Code density would go down, with more cache lines needed to hold the instructions. 16B is good, because it's the instruction fetch/decode chunk size on most recent CPUs.
Agner Fog has the low-level details for each microarch. He hasn't updated it for Broadwell, though, but the uop cache probably hasn't changed since Sandybridge. I assume there's one fairly small loop that dominates the runtime. I'm not sure exactly what to look for first. Maybe the "slow" version has some branch targets near the end of a 32B block of code (and hence near the end of a uop cacheline), leading to significantly less than 4 uops per clock coming out of the frontend.
Look at performance counters for the "slow" and "fast" versions (e.g. with perf stat ./cmd), and see if any are different. e.g. a lot more cache misses could indicate false sharing of a cache line between threads. Also, profile and see if there's a new hotspot in the "slow" version. (e.g. with perf record ./cmd && perf report on Linux).
How many uops/clock is the "fast" version getting? If it's above 3, frontend bottlenecks (maybe in the uop cache) that are sensitive to alignment could be the issue. Either that or L1 / uop-cache misses if different alignment means your code needs more cache lines than are available.
Anyway, this bears repeating: use a profiler / performance counters to find the new bottleneck that the "slow" version has, but the "fast" version doesn't. Then you can spend time looking at the disassembly of that block of code. (Don't look at gcc's asm output. You need to see the alignment in the disassembly of the final binary.) Look at the 16B and 32B boundaries, since presumably they'll be in different places between the two versions, and we think that's the cause of the problem.
Alignment can also make macro-fusion fail, if a compare/jcc splits a 16B boundary exactly. Although that is unlikely in your case, since your functions are always aligned to some multiple of 16B.
re: automated tools for alignment: no, I'm not aware of anything that can look at a binary and tell you anything useful about alignment. I wish there was an editor to show groups of 4 uops and 32B boundaries alongside your code, and update as you edit.
Intel's IACA can sometimes be useful for analyzing a loop, but IIRC it doesn't know about taken branches, and I think doesn't have a sophisticated model of the frontend, which is obviously the issue if misalignment breaks performance for you.

In my experience, the performance drops may be caused by disabling inlining optimization.
The 'inline' modifier doesn't indicate to force a function to be inlined. It gives compilers a hint to inline a function. So when the compiler's criteria of inlining optimizaion will not be satisfied by trivial modifications of code, a function which is modified with inline is normally compiled to a static function.
And there is a thing make the problem more complex, nested inline optimizations. If you have a inline function, fA, that calls a inline function, fB, like this:
inline void fB(int x, int y) {
return x * y;
}
inline void fA() {
for(int i = 0; i < 0x10000000; ++i) {
fB(i, i+1);
}
}
void main() {
fA();
}
In this case, we expect that both fA and fB are inlined. But if the inlining criteia is not met, the performance can't be predictable. That is, large performance drops occur when inlining is disable about fB, but very slight drops for fA. And you know, compiler's internal decisions are much complex.
The reasons cause disabling inlining, for example, size of inlining function, size of .c file, number of local variables, and so on.
Actually, in C#, I am experienced this performance drops. In my case, 60% performance drop occurs when one local variable is added to a simple inlining function.
EDIT:
You can investigate what happens by reading compiled assembly code. I guess there are unexpected real callings to functions modified with 'inline'.

Why instrumented C program runs faster?

I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.
Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications "should" in theory only affect cache and not directly the process execution time)
I was suspecting "instruction cache effects", maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind --tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.
Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, ...)

Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.
Very few additional work (such as incrementing a counter) might alter compiler's decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.
Incrementing a counter may increase size of some loop and prevent loop unrolling. This may decrease code size (and improve code locality, which is good for instruction or microcode caches or for loop buffer and allows CPU to fetch/decode instructions quickly).
Incrementing a counter may increase size of some function and prevent inlining. This also may decrease code size.
Incrementing a counter may prevent auto-vectorization, which again may decrease code size.
Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.
If you insert counter-incrementing code in place, full of branch targets, this may make branch targets less dense and improve branch prediction.
If you insert counter-incrementing code in front of some particular branch target, this may make branch target's address better aligned and make code fetch faster.
If you place counter-incrementing code after some data is written but before the same data is loaded again (and store-to-load forwarding did not work for some reason), the load operation may be completed earlier.
Insertion of counter-incrementing code may prevent two conflicting load attempts to the same bank in L1 data cache.
Insertion of counter-incrementing code may alter some CPU scheduler decision and make some execution port available just in time for some performance-critical instruction.
To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.
To investigate CPU effects, use a profiler allowing to inspect processor performance counters.

Just guessing from my experience with embedded compilers, Optimization tools in compilers look for recursive tasks. Perhaps the additional code forced the compiler to see something more recursive and it structured the machine code differently. Compilers do some weird things for optimization. In some languages (Perl I think?) a "not not" conditional is faster to execute than a "true" conditional. Does your debugging tool allow you to single step through a code/assembly comparison? This could add some insight as to what the compiler decided to do with the extra tasks.

How prevalent is branch prediction on current CPUs?

Due to the huge impact on performance, I never wonder if my current day desktop CPU has branch prediction. Of course it does. But how about the various ARM offerings? Does iPhone or android phones have branch prediction? The older Nintendo DS? How about PowerPC based Wii? PS 3?
Whether they have a complex prediction unit is not so important, but if they have at least some dynamic prediction, and whether they do some execution of instructions following an expected branch.
What is the cutoff for CPUs with branch prediction? A hand held calculator from decades ago obviously doesn't have one, while my desktop does. But can anyone more clearly outline where one can expect dynamic branch prediction?
If it is unclear, I am talking about the kind of prediction where the condition is changing, varying the expected path during runtime.

Any CPU with a pipeline beyond a few stages requires at least some primitive branch prediction, otherwise it can stall waiting on computation results in order to decide which way to go. The Intel Atom is an in-order core, but with a fairly deep pipeline, and it therefore requires a pretty decent branch predictor.
Old ARM 7 designs were only three stages. Combine that with things like branch delay slots (required on MIPS, optional on SPARC), and branch prediction isn't so useful.
Incidentally, when MIPS decided to get more performance by going beyond 4 pipeline stages, the branch delay slot became an annoyance. In the original design, it was necessary, because there was no branch predictor. Therefore, you had to sequence your branch instruction prior to the last instruction to be executed before the branch. With the longer pipeline, they needed a branch predictor, obviating the need for a branch delay slot, but they had to emulate it anyway in order to run older code.
The problem with a branch delay slot is that it can only be filled with a useful instruction about 50% of the time. The rest of the time, you either fill it with an instruction whose result is likely to be thrown away, or you use a NO-OP.

Modern high end superscalar CPUs with long pipelines (which means almost all CPUs commonly found in desktops and servers) have quite sophisticated branch prediction these days.
Most ARM CPUs do not have branch prediction, which saves silicon and power consumption, but ARM CPUs generally have relatively short pipelines. Also the support for conditional execution of most instructions in the ARM ISA helps to reduce the number of branches required (and hence mitigates the cost of branch misprediction stalls).

Branch prediction is getting more important and emphasized while ARM is getting more complicated.
For example new 64-bit ARM architecture called ARMv8 drops most use of conditional execution (mainly due to instruction encoding space restrictions with increased number of registers) and relies on branch prediction to keep performance at acceptable levels.
Even for newer ARMv7-a devices you can check terrible cases like unsorted data question on SO, which branch prediction improvement is around 3x.

Not so much for the ARM Cortex-A8 (though it does have some branch prediction), but I believe the Cortex-A9 is out-of-order super-scalar, with complex branch prediction.

You can expect Dynamic Branch predictor in any out of order processor, those processors not only rely on pipelining but also fetch multiple instructions at the time, and they have multiple execution units(Floating point units, ALU), more registers; to increase the instruction execution, you have multiple instructions on the fly on any given moment, of course branches are a problem if you want to keep all that machinery utilization high so this kind of processors, rely on dynamic branch prediction in order to keep throughput and utilization very high.
You can expect any server to have dynamic branch prediction, also desktops, in the past embedded systems like the ARM chips in current smartphones did not have branch predictions since they had smaller pipelines, and they did not have out of order execution, but as Moore's law give us more transistor per area, you will start seeing more and more processors increasing their architecture. So to answer your question, besides the obvious looking for the CPU specs, you can expect to have branch prediction on chips of 32 Bits, bigger pipelines, out of order exection. The most recent chips from ARM are moving in some level to this directions.

Is it possible to tell the branch predictor how likely it is to follow the branch?

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?

Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.
There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.
Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.
There are multiple reasons it doesn't work as expected.
The processor can predict small loops (n<64) perfectly.
The processor can predict small repeating patterns (n~7) perfectly.
The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.
Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.

Yes. http://kerneltrap.org/node/4705
The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.

Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.
Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.
Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).
What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.
See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)
The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).
The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)
See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.
Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).
Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.
GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?
Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.
https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)

I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.
Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.
I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.

SUN C Studio has some pragmas defined for this case.
#pragma rarely_called ()
This works if one part of a conditional expression is a function call or starts with a function call.
But there is no way to tag a generic if/while statement

No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.
Also, obligatory comment about premature optimization and how it's evil.
EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.

This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.
Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.
EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.