How to specify compiler optimizations in phoronix test suite? - benchmarking

I am trying to do some benchmarks on comparing the performance of different levels of compiler optimizations. Say gcc -o1, o2, o3. The benchmark tool I am using is Phoronix test suite. Does anybody know how to specify the compiler optimization levels? Preferably generate images like this:

have you tried setting the flags using the environment variables CFLAGS and CXXFLAGS?
e.g.
export CFLAGS = "-O2"

Related

Why would gcc change the order of functions in a binary?

Many questions about forcing the order of functions in a binary to match the order of the source file
For example, this post, that post and others
I can't understand why would gcc want to change their order in the first place?
What could be gained from that?
Moreover, why is toplevel-reorder default value is true?
GCC can change the order of functions, because the C standard (e.g. n1570 or newer) allows to do that.
There is no obligation for GCC to compile a C function into a single function in the sense of the ELF format. See elf(5) on Linux
In practice (with optimizations enabled: try compiling foo.c with gcc -Wall -fverbose-asm -O3 foo.c then look into the emitted foo.s assembler file), the GCC compiler is building intermediate representations like GIMPLE. A big lot of optimizations are transforming GIMPLE to better GIMPLE.
Once the GIMPLE representation is "good enough", the compiler is transforming it to RTL
On Linux systems, you could use dladdr(3) to find the nearest ELF function to a given address. You can also use backtrace(3) to inspect your call stack at runtime.
GCC can even remove functions entirely, in particular static functions whose calls would be inline expanded (even without any inline keyword).
I tend to believe that if you compile and link your entire program with gcc -O3 -flto -fwhole-program some non static but unused functions can be removed too....
And you can always write your own GCC plugin to change the order of functions.
If you want to guess how GCC works: download and study its source code (since it is free software) and compile it on your machine, invoke it with GCC developer options, ask questions on GCC mailing lists...
See also the bismon static source code analyzer (some work in progress which could interest you), and the DECODER project. You can contact me by email about both. You could also contribute to RefPerSys and use it to generate GCC plugins (in C++ form).
What could be gained from that?
Optimization. If the compiler thinks some code is like to be used a lot it may put that code in a different region than code which is not expected to execute often (or is an error path, where performance is not as important). And code which is likely to execute after or temporally near some other code should be placed nearby, so it is more likely to be in cache when needed.
__attribute__((hot)) and __attribute__((cold)) exist for some of the same reasons.
why is toplevel-reorder default value is true?
Because 99% of developers are not bothered by this default, and it makes programs faster. The 1% of developers who need to care about ordering use the attributes, profile-guided optimization or other features which are likely to conflict with no-toplevel-reorder anyway.

Intel c++ - optimizer messages

I wonder if it's possible to make Intel C++ compiler (or other compilers such as gcc or clang) display some messages from optimizer. I would like to know what exactly optimizer did with my code. By default compiler prints only very basic things like unused variable. very simple example - I want to know that expression;
float x = 1.0f/2;
will be evaluated into:
float x = 0.5f;
and there will be no division in code (I know that in this case it's always true, but this is just an example). More advanced example could be loop unroll or operations reorder.
Thanks in advance.
For icc and icpc, you can use the -opt-report -opt-report-level max set of flags.
You can also specify an opt-report file. See here for more details
An optimizing compiler (like GCC, when asked to optimize with -O1 or -O2 etc...) is essentially transforming internal representations of your source code.
If you want to see some of the internal GCC representations, you could pass -fdump-tree-all to GCC. Beware, you'll get hundreds of dump files.
You could also use the MELT probe: MELT is a domain specific language (and plugin implementation) to extend GCC, and it has a probe mode to interactively show some of the internal (notably Gimple) representations.
The optimization you describe at the top of the post is (somewhat strangely) part of icc -fno-prec-div (which is a default which you might be overriding).

What's the proper way to use different versions of SSE intrinsics in GCC?

I will ask my question by giving an example. Now I have a function called do_something().
It has three versions: do_something(), do_something_sse3(), and do_something_sse4(). When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly.
The problem is: When I build my program with GCC, I have to set -msse4 for do_something_sse4() to compile (e.g. for the header file <smmintrin.h> to be included).
However, if I set -msse4, then gcc is allowed to use SSE4 instructions, and some intrinsics in do_something_sse3() is also translated to some SSE4 instructions. So if my program runs on CPU that has only SSE3 (but no SSE4) support, it causes "illegal instruction" when calls do_something_sse3().
Maybe I have some bad practice. Could you give some suggestions? Thanks.
I think that the Mystical's tip is fine, but if you really want to do it in the one file, you can use proper pragmas, for instance:
#pragma GCC target("sse4.1")
GCC 4.4 is needed, AFAIR.
I think you want to build what's called a "CPU dispatcher". I got one working (as far as I know) for GCC but have not got it to work with Visual Studio.
cpu dispatcher for visual studio for AVX and SSE
I would check out Agner Fog's vectorclass and the file dispatch_example.cpp
http://www.agner.org/optimize/#vectorclass
g++ -O3 -msse2 -c dispatch_example.cpp -od2.o
g++ -O3 -msse4.1 -c dispatch_example.cpp -od5.o
g++ -O3 -mavx -c dispatch_example.cpp -od8.o
g++ -O3 -msse2 instrset_detect.cpp d2.o d5.o d8.o
Here is an example of compiling a separate object file for each optimization setting:
http://notabs.org/lfsr/software/index.htm
But even this method fails when gcc link time optimization (-flto) is used. So how can a single executable be built with full optimization for different processors? The only solution I can find is to use include directives to make the C files behave as a single compilation unit so that -flto is not needed. Here is an example using that method:
http://notabs.org/blcutil/index.htm
If you are using GCC 4.9 or above on an i686 or x86_64 machine, then you are supposed to be able to use intrinsics regardless of your -march=XXX and -mXXX options. You could write your do_something() accordingly:
void do_something()
{
byte temp[18];
if (HasSSE2())
{
const __m128i i = _mm_loadu_si128((const __m128i*)(ptr));
...
}
else if (HasSSSE3())
{
const __m128i MASK = _mm_set_epi8(12,13,14,15, 8,9,10,11, 4,5,6,7, 0,1,2,3);
_mm_storeu_si128(reinterpret_cast<__m128i*>(temp),
_mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)(ptr)), MASK));
}
else
{
// Do the byte swap/endian reversal manually
...
}
}
You have to supply HasSSE2(), HasSSSE3() and friends. Also see Intrinsics for CPUID like informations?.
Also see GCC Issue 57202 - Please make the intrinsics headers like immintrin.h be usable without compiler flags. But I don't believe the feature works. I regularly encounter compile failures because GCC does not make intrinsics available.

Function specific optimization in GCC 4.4.3

In reference to my earlier question here, I found out a possilbe bug in GCC 4.4.3 when it did not support following pragmas in the source code for optimization (although it says 4.4.x onwards it does!)
#pragma GCC optimize ("O3")
__attribute__((optimize("O3")))
Tried both above options but both gave compile time errors in the compiler itself(See the error message snapshot posted in the link mentioned above)
Now are there any further options for me to enable different optimization levels for different functions in my C code?
From the online docs:
Numbers are assumed to be an optimization level. Strings that begin with O are assumed to be an optimization option, while other options are assumed to be used with a -f prefix.
So, if you want the equivalent of the command line -O3 you should probably use the just the number 3 instead of "O3".
I agree that this is a bug and should not generate an ICE, consider reporting it along with a small test case to the GCC guys.
Now are there any further options for me to enable different optimization levels for different functions
in my C code?
Your remaining option is to place the functions in their own .c file and compile that .c file with the optimization flag you want.

safe, fast CFLAGS for mex functions in matlab

I am converting a number of low-level operations from native matlab code into C/mex code, with great speedups. (These low-level operations can be done vectorized in .m code, but I think I get memory hits b/c of large data. whatever.) I have noticed that compiling the mex code with different CFLAGS can cause mild improvements. For example CFLAGS = -O3 -ffast-math does indeed give some speedups, at the cost of mild numerical inaccuracy.
My question: what are the "best" CFLAGS to use, without incurring too many other side effects? It seems that, at the very least that
CFLAGS = -O3 -fno-math-errno -fno-unsafe-math-optimizations -fno-trapping-math -fno-signaling-nans are all OK. I'm not sure about -funroll-loops.
also, how would you optimize the set of CFLAGS used, semi-automatically, without going nuts?
If you know the target CPU...or are at least willing to guarantee a "minimum" CPU...you should definitely look into -mcpu and -march.
The performance gain can be significant.
Whatever ATLAS uses on your machine (http://math-atlas.sourceforge.net/) is probably a good starting point. I don't know that ATLAS automatically optimizes specific compiler flags, but the developers have probably spent a fair amount of time doing so by hand.

Resources