what's the different between -mtune=i486 and -arch=i486 - c

I have an old code need compiled with -m486 flag in GCC.
But there is no that flag. Then I found -mtune=i486 and -arch=i486
I have read this page.
But still don't know which is the best one for -m486?

The -march option defines the list of instructions that may be used, the -mtune option modifies the optimization process afterwards.
You would normally use -march to specify the minimum requirements, and -mtune to optimize for what the majority of users have.
For example, the IA32 architecture defines various instructions for string handling and repetition of instructions. On the 386 and 486, these are faster and smaller than explicit assembler code because the instruction fetch and decode stages can be skipped, while on newer models, these instructions clog up the instruction pipeline as each processing step is immediately dependent on the previous, so the CPU's parallel execution functionality goes to waste.
Linux distributions typically use -march=i486 -mtune=i686 to ensure that you can still install and run on a 486, but as the majority of users have modern CPUs, the focus is on making it run optimally for these.

Related

Is Eigen NEON backend optimized to take advantage of the 2x128b NEON execution units which exist starting from ARM A76?

Going over Eigen documentation, its not clear whether it was updated since the release of A76 CPU core to take advantage of the wider SIMD it contains (2x128b vs. previous 128b)
I am hoping someone from the development team (or an expert user) can help clarifying that.
I'm not familiar with Eigen in particular, but in general, one doesn't need to do much to SIMD code to take advantage of different amounts of hardware execution units - especially when the CPUs support out of order execution, they will pick up more instructions that can be executed in parallel when there's more execution units.
If compiling e.g. SIMD intrinsics with a compiler, the compiler may be able to tune the exact scheduling of code if told to optimize specifically for that core (and if the compiler knows the scheduling characteristics for the core). Same thing for handwritten assembly code - it can be tuned and tweaked a bit for different cores' characteristics, but in most cases, it doesn't change very dramatically; more capable cores will execute it faster.
(The factor that primarily affects the bigger picture of how the code is written, which would require a proper rewrite to take advantage of, is usually the number of registers available in the instruction set - but that doesn't change with a hardware implementation with more execution units.)

Tell branch predictor which way to go every n iterations [duplicate]

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?
Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.
There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.
Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.
There are multiple reasons it doesn't work as expected.
The processor can predict small loops (n<64) perfectly.
The processor can predict small repeating patterns (n~7) perfectly.
The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.
Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.
Yes. http://kerneltrap.org/node/4705
The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.
Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.
Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.
Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).
What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.
See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)
The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).
The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)
See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.
Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).
Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.
GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?
Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.
https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)
I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.
Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.
I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.
SUN C Studio has some pragmas defined for this case.
#pragma rarely_called ()
This works if one part of a conditional expression is a function call or starts with a function call.
But there is no way to tag a generic if/while statement
No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.
Also, obligatory comment about premature optimization and how it's evil.
EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.
This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.
Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.
EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

Is there something like x86 cpuid() available for PowerPC?

I'd like to write some C code be able to query processor attributes on PowerPC, much like one can do with cpuid on x86. I'm after things like brand, model, stepping, SIMD width, available operations, so that there can be run-time confirmation that the code is being used on a compatible platform before something blows up.
Is there a general mechanism for doing this on PowerPC? If so, where can one read about it?
Note that PowerPC has not dozens of extensions / features like x86. It is required to read specific privileged registers that may depend on cores.
I checked on Linux and you can access PVR, there is a trap in the kernel to manage that.
Reading /proc/cpuinfo can return if Altivec is supported, the memory and L2 cache size ... but that is not really convenient.
A better solution is described here:
http://www.freehackers.org/thomas/2011/05/13/how-to-detect-altivec-availability-on-linuxppc-at-runtime/
That uses the content of /proc/self/auxv that provides "the ELF interpreter information passed to the process at exec time".
The example is about Altivec but you can get other features (listed in include "asm/cputable.h"): 32 or 64 bit cpu, Altivec, SPE, FPU, MMU, 4xx MAC, ...
Last, you will find information on caches (size, line size, associativity, ...), look at files in:
/sys/devices/system/cpu/cpu0/cache
PowerPC doesn't have an analogue to the CPUID instruction. The closest you can get is to read the PVR (processor version register). This is a supervisor-privileged SPR, though. However, some operating systems, FreeBSD for example, will trap and execute that for user space processes.
The PVR is read-only, and should be unique for any given processor model and revision. Given this, you can ascertain what features are provided by a given CPU.

Will it be fine if we use msse and msse2 option of gcc in RTOS devices

As per my knowledge msse and msse2 option of gcc will improve the performance by performing arithmetic operation faster. And also I read some where like it will use more resources like registers, cache memory.
What about the performance if we use the executable generated with these options on RTOS devices(like vxworks board) ?
The OS must support SSE(2) instructions for your application to work correctly. It would seem, from googling, that VcWorks supports this (and it's not really that hard, all it takes is that the OS has a 512 byte save-area per task that uses SSE/SSE2 - given the right circumstances, it can be allocated on demand, but it's often easier to just allocate it to all tasks]. Saving/restoring SSE registers is done "on demand", that is, only when a task different from the previous one to use SSE is using SSE instructions, is it necessary to save the registers. The OS will use a special interrupt(trap) to indicate that "a new task is trying to use SSE instructions.
So, as long as the processor supports it, you should be fine.
I may not be able to directly answer your question, but here are a couple things I do know that may be of use:
SSE, SSE2, etc. must be supported/implemented by the processor for them to have any affect in the first place.
There are specific functions you can call that use these extended instructions for mathematical operations. These functions operate on wider data types or perform an operation on a set efficiently.
Enabling the options in GCC may use the previous APIs/builtins automatically. This is the part I am unsure about.

Is it possible to tell the branch predictor how likely it is to follow the branch?

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?
Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.
There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.
Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.
There are multiple reasons it doesn't work as expected.
The processor can predict small loops (n<64) perfectly.
The processor can predict small repeating patterns (n~7) perfectly.
The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.
Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.
Yes. http://kerneltrap.org/node/4705
The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.
Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.
Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.
Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).
What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.
See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)
The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).
The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)
See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.
Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).
Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.
GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?
Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.
https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)
I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.
Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.
I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.
SUN C Studio has some pragmas defined for this case.
#pragma rarely_called ()
This works if one part of a conditional expression is a function call or starts with a function call.
But there is no way to tag a generic if/while statement
No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.
Also, obligatory comment about premature optimization and how it's evil.
EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.
This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.
Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.
EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

Resources