PGO slower than static optimization (intel compiler)

PGO slower than static optimization (intel compiler) - c

I am using Intel C Compiler for I-32A architecture.
When I compile my C program with the following options:
icl mytest.c /openmp /QxHost /fp:fast /fast
The test run takes 3.3s. Now I tried to use PGO, so I compiled with:
icl mytest.c /openmp /QxHost /fp:fast /fast /Qprof-gen
I then run the executable with my sample input 2-3 times and compile again with:
icl mytest.c /openmp /QxHost /fp:fast /fast /Qprof-use
Hoping it will take into account collected information. It in fact tells me it's using the .dyn files but resulting executable is slower (3.85s) than without Qprof-use and this is on exactly the same data the runs were performed (should be perfect for PGO).
I tried setting openmp threads to one, thinking it might mess with .dyn output but the result is the same - it's slower than simple compilation.
My question is: is it even theoretically possible or I am messing up PGO process somehow with the compiler options ?

A 3.3-second floating-point application isn't going to see benefit from profile-guided optimization. From my guess, you're doing some sort of raw data crunching, which is better suited to hand-coded assembly if you need raw FLOPs than it is to PGO.
PGO will not tell the compiler how to optimize your inner loop to remove branch delays and keep the pipeline full. It may tell it if your loop is likely to run only 5,000 times or if your floats satisfy some criteria.
It is used with data that is statistically representative of other data you want it to run on. In other words you use it with data on a program that you want to be able to run other data with at a good clip. It doesn't necessarily optimize for the program at hand and, as you said, may even slow it down a bit for a possible net gain.
It really depends on your program but an OpenMP FP app is not what PGO is for. Like everything else it isn't a "magic bullet."

Related

How to create a mini-benchmark program in C

I have an assignment where I need to make a benchmark program to test the performance of any processor with two sorting algorithms (an iterative one and a recursive one). The thing is my teacher told me I have to create three different programs (that is, 3 .c files), two with each sorting algorithm (both of them have to read integers from a text file separated with \n's and write the same numbers to another text file but sorted), and a benchmarking program. In the benchmark program I need to calculate the MIPs (million instructions per second) with the formula MIPs = NI/T*10^6, where NI is the number of instructions and T is the time required to execute those instructions. I have to be able to estimate the time each algorithm will take on any processor by calculating its MIPs and then solving that equation for T, like EstimatedTime = NI/MIPs*10^6.
My question is... how exactly do I measure the performance of a program with another program? I have never done something like that. I mean, I guess I can use the TIME functions in C and measure the time to execute X number of lines and stuff, but I can do that only if all 3 functions (2 sorting algorithms and 1 benchmark function) are in the same program. I don't even know how to start.
Oh and btw, I have to calculate the number of instructions by cross compiling the sorting algorithms from C to MIPS (the asm language) and counting how many instructions were used.
Any guidelines would be appreciated... I currently have these functions:
readfile (to read text files with ints on them)
writefile
sorting algorithms

On a Linux system, you can use hardware performance counters: perf stat ./a.out and get an accurate count of cycles, instructions, cache misses, and branch mispredicts. (other counters available, too, but those are the default ones).
This gives you the dynamic instruction count, counting instructions inside loops the number of times they actually ran.
Cross-compiling for MIPS and counting instructions would easily give you a static instruction count, but would require actually following how the asm works to figure out how many times each loop runs.

How you compile the several files and link them together depends on the compiler. With GCC for example it could be something as simple as
gcc -O3 -g3 -W -Wall -Wextra main.c sortalog1.c sortalgo_2.c [...] sortalgo_n.c -o sortingbenchmark
It's not the most common way to do it, but good enough for this assignment.
If you want to count the opcodes it is probably better to compile the individual c-files individually to ASM. Do the following for every C-file you want to analyze the assembler output:
gcc -c -S sortalgo_n.c
Don't forget to put your function declarations into a common header file and include it everywhere you use them!
For benchmarking: you do know the number of ASM-operations for every C-operation and can, although it's not easy, map that count to every line of the C code. If you have that, all you have to do is to increment a counter. E.g.: if a line of C-code translates to 123 ASM opcodes you increment the counter by 123.
You can use one global variable to do so. If you use more than one thread per sorting algorithm you need to take care that the additions are atomic (Either use _Atomic or mutexes or whatever your OS/compiler/libraries offer).
BTW: it looks like a very exact way to measure the runtime but not every ASM-opcode runs in the same number of cycles on the CPU in the real world. No need for bothering today but you should keep it in mind for tomorrow.

Best gcc optimization switches for hyperthreading

Background
I have an EP (Embarassingly Parallell) C application running four threads on my laptop which contains an intel i5 M 480 running at 2.67GHz. This CPU has two hyperthreaded cores.
The four threads execute the same code on different subsets of data. The code and data have no problems fitting in a few cache lines (fit entirely in L1 with room to spare). The code contains no divisions, is essentially CPU-bound, uses all available registers and does a few memory accesses (outside L1) to write results on completion of the sequence.
The compiler is mingw64 4.8.1 i e fairly recent. The best basic optimization level appears to be -O1 which results in four threads that complete faster than two. -O2 and higher run slower (two threads complete faster than four but slower than -O1) as does -Os. Every thread on average does 3.37 million sequences every second which comes out to about 780 clock cycles for each. On average every sequence performs 25.5 sub-operations or one per 30.6 cycles.
So what two hyperthreads do in parallell in 30.6 cycles one thread will do sequentially in 35-40 or 17.5-20 cycles each.
Where I am
I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.
These switches work fairly well (when compiling module by module)
-O1 -m64 -mthreads -g -Wall -c -fschedule-insns
as do these when compiling one module which #includes all the others
-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program
there is no discernible performance difference between the two.
Question
Has anyone experimented with this and achieved good results?

You say "I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.". That's rather misguided.
Your CPU has a certain amount of resources. Code will be able to use some of the resources, but usually not all. Hyperthreading means you have two threads capable of using the resources, so a higher percentage of these resources will be used.
What you want is to maximise the percentage of resources that are used. Efficient code will use these resources more efficiently in the first place, and adding hyper threading can only help. You won't get that much of a speedup through hyper threading, but that is because you got the speedup already in single threaded code because it was more efficient. If you want bragging rights that hyper threading gave you a big speedup, sure, start with inefficient code. If you want maximum speed, start with efficient code.
Now if your code was limited by latencies, it means it could perform quite a few useless instructions without penalty. With hyper threading, these useless instructions actually cost. So for hyper threading, you want to minimise the number of instructions, especially those that were hidden by latencies and had no visible cost in single threaded code.

You could try locking each thread to a core using processor affinity. I've heard this can give you 15%-50% improved efficiency with some code. The saving being that when the processor context switch happens there is less changed in the caches etc..
This will work better on a machine that is just running your app.

It's possible that hyperthreading be counterproductive.
It happens it is often counterproductive with computationally intensive loads.
I would give a try to:
disable it at bios level and run two threads
try to optimize and use vector SSE/AVX extensions, eventually even by hand
explanation: HT is useful because hardware threads get scheduled more efficiently that software threads. However there is an overhead in both. Scheduling 2 threads is more lightweight than scheduling 4, and if your code is already "dense", I'd try to go for "denser" execution, optimizing as more as possible the execution on 2 pipelines.
It's clear that if you optimize less, it scales better, but difficulty it will be faster. So if you are looking for more scalability - this answer is not for you... but if you are looking for more speed - give it a try.
As others has already stated, there is not a general solution when optimizing, otherwise this solution should be embedded in the compilers already.

You could download an OpenCL or CUDA toolkit and implement a version for your graphic card... you maybe able to speed it up 100 fold with little effort.

Compiler -march flag benchmark?

does -march flag in compilers (for example: gcc) really matters?
would it be faster if i compile all my programs and kernel using -march=my_architecture instead of -march=i686

Yes it does, though the differences are only sometimes relevant. They can be quite big however if your code can be vectorized to use SSE or other extended instruction sets which are available on one architecture but not on the other. And of course the difference between 32 and 64 bit can (but need not always) be noticeable (that's -m64 if you consider it a type of -march parameter).
As anegdotic evidence, a few years back I run into a funny bug in gcc where a particular piece of code which was run on a Pentium 4 would be about 2 times slower when compiled with -march=pentium4 than when compiled with -march=pentium2.
So: often there is no difference, and sometimes there is, sometimes it's the other way around than you expect. As always: measure before you decide to use any optimizations that go beyond the "safe" range (e.g. using your actual exact CPU model instead of a more generic one).

There is no guarantee that any code you compile with march will be faster/slower w.r.t. the other version. It really depends on the 'kind' of code and the actual result may be obtained only by measurement. e.g., if your code has lot of potential for vectorization then results might be different with and without 'march'. On the other hand, sometimes compiler do a poor job during vectorization and that might result in slower code when compiled for a specific architecture.

how to compare different optimization level files from gprof

everyone, I am running the gprof to check the percentage execution time in two different optimization level (-g -pg vs -O3 -pg).
So I got the result that one function takes 68% exc-time in O3, but only 9% in -g version.
I am not sure how to find out the reason behind it. I am thinking compare the two version files before compiled, but i am not sure the cmd to do so.
Is there any other method to find out the reasons for this execution time difference.

You have to be careful interpreting gprof/profiling results when you're using optimization flags. Compiling with -O3 can really change the structure of your code, so that it's impossible for gprof to tell how much time is spent where.
In particular, function inlining enabled with the higher optimization levels make it that some of your functions will be completely replaced by inline code, so that they don't appear to take any time at all. The time that would be spent in those child functions is then attributed to the parent functions that call them, so it can look like the time spent in a given parent function actually increased.
I couldn't find a really good reference for this. Here's one old example:
http://gcc.gnu.org/ml/gcc/1998-04/msg00591.html
That being said, I would expect this kind of strange behavior when running gprof with -O3. I always do profiling with just -O1 optimization to minimize these kinds of effects.

I think that there's a fundamental flaw in your reasoning: that the fact that it takes 68% of execution time in the optimized version vs just the 9% in the unoptimized version means that the unoptimized version performs better.
I'm quite sure, instead, that the -O3 version performs better in absolute terms, but the optimizer did a way better job on the other functions, so, in proportion to the rest of the optimized code, the given subroutine results slower - but it's actually faster - or, at least, as fast - than the unoptimized version.
Still, to check directly the differences in the emitted code you can use the -S switch. Also, to see if my idea is correct, you can roughly compare the CPU time took by the function in -O0 vs -03 multiplying that percentage with the user time took by your program provided by a command like time (also, I'm quite sure that you can obtain a measure of absolute time spent in a subroutine in gprof, IIRC it was even in the default output).

what does the -p and -g flag in compiler

I have been profiling a C code and to do so I compiled with -p and -g flags. So I was wandering what do these flags actually do and what overhead do they add to the binary?
Thanks

Assuming you are using GCC, you can get this kind of information from the GCC manual
http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options
-p
Generate extra code to write profile information suitable for the
analysis program prof. You must use this option when compiling the
source files you want data about, and you must also use it when
linking.
-g
Produce debugging information in the operating system's native format
(stabs, COFF, XCOFF, or DWARF 2). GDB can work with this debugging
information.
On most systems that use stabs format, -g enables use of extra debugging information that only GDB can use; this extra information
makes debugging work better in GDB but will probably make other
debuggers crash or refuse to read the program. If you want to control
for certain whether to generate the extra information, use -gstabs+,
-gstabs, -gxcoff+, -gxcoff, or -gvms (see below).
GCC allows you to use -g with -O. The shortcuts taken by optimized code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values were already at hand;
some statements may execute in different places because they were
moved out of loops.
Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have
bugs.

-p provides information for prof, and -pg provides information for gprof.
Let's look at the latter.
Here's an explanation of how gprof works,
but let me condense it here.
When a routine B is compiled with -pg, some code is inserted at the routine's entry point that looks up which routine is calling it, say A.
Then it increments a counter saying that A called B.
Then when the code is executed, two things are happening.
The first is that those counters are being incremented.
The second is that timer interrupts are occurring, and there is a counter for each routine, saying how many of those interrupts happened when the PC was in the routine.
The timer interrupts happen at a certain rate, like 100 times per second.
Then if, for example, 676 interrupts occurred in a routine, you can tell that its "self time" was about 6.76 seconds, spread over all the calls to it.
What the call counts allow you to do is add them up to tell how many times a routine was called, so you can divide that into its total self time to estimate how much self time per call.
Then from that you can start to estimate "cumulative time".
That's the time spent in a routine, plus time spent in the routines that it calls, and so on down to the bottom of the call tree.
This is all interesting technology, from 1982, but if your goal is to find ways to speed up your program, it has a lot of issues.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight