I have been profiling a C code and to do so I compiled with -p and -g flags. So I was wandering what do these flags actually do and what overhead do they add to the binary?
Thanks
Assuming you are using GCC, you can get this kind of information from the GCC manual
http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options
-p
Generate extra code to write profile information suitable for the
analysis program prof. You must use this option when compiling the
source files you want data about, and you must also use it when
linking.
-g
Produce debugging information in the operating system's native format
(stabs, COFF, XCOFF, or DWARF 2). GDB can work with this debugging
information.
On most systems that use stabs format, -g enables use of extra debugging information that only GDB can use; this extra information
makes debugging work better in GDB but will probably make other
debuggers crash or refuse to read the program. If you want to control
for certain whether to generate the extra information, use -gstabs+,
-gstabs, -gxcoff+, -gxcoff, or -gvms (see below).
GCC allows you to use -g with -O. The shortcuts taken by optimized code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values were already at hand;
some statements may execute in different places because they were
moved out of loops.
Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have
bugs.
-p provides information for prof, and -pg provides information for gprof.
Let's look at the latter.
Here's an explanation of how gprof works,
but let me condense it here.
When a routine B is compiled with -pg, some code is inserted at the routine's entry point that looks up which routine is calling it, say A.
Then it increments a counter saying that A called B.
Then when the code is executed, two things are happening.
The first is that those counters are being incremented.
The second is that timer interrupts are occurring, and there is a counter for each routine, saying how many of those interrupts happened when the PC was in the routine.
The timer interrupts happen at a certain rate, like 100 times per second.
Then if, for example, 676 interrupts occurred in a routine, you can tell that its "self time" was about 6.76 seconds, spread over all the calls to it.
What the call counts allow you to do is add them up to tell how many times a routine was called, so you can divide that into its total self time to estimate how much self time per call.
Then from that you can start to estimate "cumulative time".
That's the time spent in a routine, plus time spent in the routines that it calls, and so on down to the bottom of the call tree.
This is all interesting technology, from 1982, but if your goal is to find ways to speed up your program, it has a lot of issues.
Related
I need a better way of profiling numerical code. Assume that I'm using GCC in Cygwin on 64 bit x86 and that I'm not going to purchase a commercial tool.
The situation is this. I have a single function running in one thread. There are no code dependencies or I/O beyond memory accesses, with the possible exception of some math libraries linked in. But for the most part, it's all table look-ups, index calculations, and numerical processing. I've cache aligned all arrays on the heap and stack. Due to the complexity of the algorithm(s), loop unrolling, and long macros, the assembly listing can become quite lengthy -- thousands of instructions.
I have been resorting to using either, the tic/toc timer in Matlab, the time utility in the bash shell, or using the time stamp counter (rdtsc) directly around the function. The problem is this: the variance (which might be as much as 20% of the runtime) of the timing is larger than the size of the improvements I'm making, so I have no way of knowing if the code is better or worse after a change. You might think then it's time to give up. But I would disagree. If you are persistent, many incremental improvements can lead to a two or three times performance increase.
One problem I have had multiple times that is particularly maddening is that I make a change and the performance seems to improve consistently by say 20%. The next day, the gain is lost. Now it's possible I made what I thought was an innocuous change to the code and then completely forgot about it. But I'm wondering if it's possible something else is going on. Like maybe GCC doesn't yield a 100% deterministic output as I believe it does. Or maybe it's something simpler, like the OS moved my process to a busier core.
I have considered the following, but I don't know if any of these ideas are feasible or make any sense. If yes, I would like explicit instructions on how to implement a solution. The goal is to minimize the variance of the runtime so I can meaningfully compare different versions of optimized code.
Dedicate a core of my processor to run only my routine.
Direct control over the cache(s) (load it up or clear it out).
Ensuring my dll or executable always loads to the same place in memory. My thinking here is that maybe the set-associativity of the cache interacts with the code/data location in RAM to alter performance on each run.
Some kind of cycle accurate emulator tool (not commercial).
Is it possible to have a degree of control over context switches? Or does it even matter? My thinking is the timing of the context switches is causing variability, maybe by causing the pipeline to be flushed at an inopportune time.
In the past I have had success on RISC architectures by counting instructions in the assembly listing. This only works, of course, if the number of instructions is small. Some compilers (like TI's Code Composer for the C67x) will give you a detailed analysis of how it's keeping the ALU busy.
I haven't found the assembly listings produced by GCC/GAS to be particularly informative. With full optimization on, code is moved all over the place. There can be multiple location directives for a single block of code dispersed about the assembly listing. Further, even if I could understand how the assembly maps back into my original code, I'm not sure there's much correlation between instruction count and performance on a modern x86 machine anyway.
I made a weak attempt at using gcov for line-by-line profiling, but due to an incompatibility between the version of GCC I built and the MinGW compiler, it wouldn't work.
One last thing you can do is average over many, many trial runs, but that takes forever.
EDIT (RE: Call Stack Sampling)
The first question I have is, practically, how do I do this? In one of your power point slides, you showed using Visual Studio to pause the program. What I have is a DLL compiled by GCC with full optimizations in Cygwin. This is then called by a mex DLL compiled by Matlab using the VS2013 compiler.
The reason I use Matlab is because I can easily experiment with different parameters and visualize the results without having to write or compile any low level code. Further, I can compare my optimized DLL to the high level Matlab code to ensure my optimizations have not broken anything.
The reason I use GCC is that I have a lot more experience with it than with Microsoft's compiler. I'm familiar with many flags and extensions. Further, Microsoft has been reluctant, at least in the past, to maintain and update the native C compiler (C99). Finally, I've seen GCC kick the pants off commercial compilers, and I've looked at the assembly listing to see how it's actually done. So I have some intuition of how the compiler actually thinks.
Now, with regards to making guesses about what to fix. This isn't really the issue; it's more like making guesses about how to fix it. In this example, as is often the case in numerical algorithms, there is really no I/O (excluding memory). There are no function calls. There's virtually no abstraction at all. It's like I'm sitting on top of a piece of saran wrap. I can see the computer architecture below, and there's really nothing in-between. If I re-rolled up all the loops, I could probably fit the code on about one page or so, and I could almost count the resultant assembly instructions. Then I could do a rough comparison to the theoretical number of operations a single core is capable of doing to see how close to optimal I am. The trouble then is I lose the auto-vectorization and instruction level parallelization I got from unrolling. Unrolled, the assembly listing is too long to analyze in this way.
The point is that there really isn't much to this code. However, due to the incredible complexity of the compiler and modern computer architecture, there is quite a bit of optimization to be had even at this level. But I don't know how small changes are going to affect the output of the compiled code. Let me give a couple of examples.
This first one is somewhat vague, but I'm sure I've seen it happen a few times. You make a small change and get a 10% improvement. You make another small change and get another 10% improvement. You undo the first change and get another 10% improvement. Huh? Compiler optimizations are neither linear, nor monotonic. It's possible, the second change required an additional register, which broke the first change by forcing the compiler to alter its register allocation algorithm. Maybe, the second optimization somehow occluded the compiler's ability to do optimizations which was fixed by undoing the first optimization. Who knows. Unless the compiler is introspective enough to dump its full analysis at every level of abstraction, you'll never really know how you ended up with the final assembly.
Here is a more specific example which happened to me recently. I was hand coding AVX intrinsics to speed up a filter operation. I thought I could unroll the outer loop to increase instruction level parallelism. So I did, and the result was that the code was twice as slow. What happened was there were not enough 256 bit registers to go around. So the compiler was temporarily saving results on the stack, which killed performance.
As I was alluding to in this post, which you commented on, it's best to tell the compiler what you want, but unfortunately, you often have no choice and are forced to hand tweak optimizations, usually via guess and check.
So I guess my question would be, in these scenarios (the code is effectively small until unrolled, each incremental performance change is small, and you're working at a very low level of abstraction), would it be better to have "precision of timing" or is call stack sampling better at telling me which code is superior?
I've faced a similar problem some time ago but that was on Linux which made it easier to tweak. Basically the noise introduced by OS (called "OS jitter") was as big as 5-10% in SPEC2000 tests (I can imagine it's much higher on Windows due to much bigger amount of bloatware).
I was able to bring deviation to below 1% by combination of the following:
disable dynamic frequency scaling (better do this both in BIOS and in Linux kernel as not all kernel versions do this reliably)
disable memory prefetching and other fancy settings like "Turbo boost", etc. (BIOS, again)
disable hyperthreading
enable high-performance process scheduler in kernel
bind process to core to prevent thread migration (use core 0 - for some reason it was more reliable on my kernel, go figure)
boot to single-user mode (in which no services are running) - this isn't as easy in modern systemd-based distros
disable ASLR
disable network
drop OS pagecache
There may be more to it but 1% noise was good enough for me.
I might put detailed instructions to github later today if you need them.
-- EDIT --
I've published my benchmarking script and instructions here.
Am I right that what you're doing is making an educated guess of what to fix, fixing it, and then trying to measure to see if it made any difference?
I do it a different way, which works especially well as the code gets large.
Rather than guess (which I certainly can) I let the program tell me how the time is spent, by using this method.
If the method tells me that roughly 30% is spent doing such-and-so, I can concentrate on finding a better way to do that.
Then I can run it and just time it.
I don't need a lot of precision.
If it's better, that's great.
If it's worse, I can undo the change.
If it's about the same, I can say "Oh well, maybe it didn't save much, but let's do it all again to find another problem,"
I need not worry.
If there's a way to speed up the program, this will pinpoint it.
And often the problem is not just a simple statement like "line or routine X spends Y% of the time", but "the reason it's doing that is Z in certain cases" and the actual fix may be elsewhere.
After fixing it, the process can be done again, because a different problem, which was small before, is now larger (as a percent, because the total has been reduced by fixing the first problem).
Repetition is the key, because each speedup factor multiplies all the previous, like compound interest.
When the program no longer points out things I can fix, I can be sure it is nearly optimal, or at least nobody else is likely to beat it.
And at no point in this process did I need to measure the time with much precision.
Afterwards, if I want to brag about it in a powerpoint, maybe I'll do multiple timings to get smaller standard error, but even then, what people really care about is the overall speedup factor, not the precision.
I have an assignment where I need to make a benchmark program to test the performance of any processor with two sorting algorithms (an iterative one and a recursive one). The thing is my teacher told me I have to create three different programs (that is, 3 .c files), two with each sorting algorithm (both of them have to read integers from a text file separated with \n's and write the same numbers to another text file but sorted), and a benchmarking program. In the benchmark program I need to calculate the MIPs (million instructions per second) with the formula MIPs = NI/T*10^6, where NI is the number of instructions and T is the time required to execute those instructions. I have to be able to estimate the time each algorithm will take on any processor by calculating its MIPs and then solving that equation for T, like EstimatedTime = NI/MIPs*10^6.
My question is... how exactly do I measure the performance of a program with another program? I have never done something like that. I mean, I guess I can use the TIME functions in C and measure the time to execute X number of lines and stuff, but I can do that only if all 3 functions (2 sorting algorithms and 1 benchmark function) are in the same program. I don't even know how to start.
Oh and btw, I have to calculate the number of instructions by cross compiling the sorting algorithms from C to MIPS (the asm language) and counting how many instructions were used.
Any guidelines would be appreciated... I currently have these functions:
readfile (to read text files with ints on them)
writefile
sorting algorithms
On a Linux system, you can use hardware performance counters: perf stat ./a.out and get an accurate count of cycles, instructions, cache misses, and branch mispredicts. (other counters available, too, but those are the default ones).
This gives you the dynamic instruction count, counting instructions inside loops the number of times they actually ran.
Cross-compiling for MIPS and counting instructions would easily give you a static instruction count, but would require actually following how the asm works to figure out how many times each loop runs.
How you compile the several files and link them together depends on the compiler. With GCC for example it could be something as simple as
gcc -O3 -g3 -W -Wall -Wextra main.c sortalog1.c sortalgo_2.c [...] sortalgo_n.c -o sortingbenchmark
It's not the most common way to do it, but good enough for this assignment.
If you want to count the opcodes it is probably better to compile the individual c-files individually to ASM. Do the following for every C-file you want to analyze the assembler output:
gcc -c -S sortalgo_n.c
Don't forget to put your function declarations into a common header file and include it everywhere you use them!
For benchmarking: you do know the number of ASM-operations for every C-operation and can, although it's not easy, map that count to every line of the C code. If you have that, all you have to do is to increment a counter. E.g.: if a line of C-code translates to 123 ASM opcodes you increment the counter by 123.
You can use one global variable to do so. If you use more than one thread per sorting algorithm you need to take care that the additions are atomic (Either use _Atomic or mutexes or whatever your OS/compiler/libraries offer).
BTW: it looks like a very exact way to measure the runtime but not every ASM-opcode runs in the same number of cycles on the CPU in the real world. No need for bothering today but you should keep it in mind for tomorrow.
How do I use ftrace() (or anything else) to trace a specific, user-defined function in the Linux kernel? I'm trying to create and run some microbenchmarks, so I'd like to have the time it takes certain functions to run. I've read through (at least as much as I can) the documentation, but a step in the right direction would be awesome.
I'm leaning towards ftrace(), but having issues getting it to work on Ubuntu 14.04.
Here are a couple of options you may have depending on the version of the kernel you are on:
Systemtap - this is the ideal way check the examples that come with the stap, you may have something ready with minimal modifications to do.
Oprofile - if you are using older versions of the kernel, stap gives better precision compared to oprofile.
debugfs with stack tracer option - good for stack overrun debugging. To do this you would need to turn on depth checking functions by mounting debugfs and then echo 1 > /proc/sys/kernel/stack_tracer_enabled.
strace - if you are looking at identifying the system calls being called by the user space program and some performance numbers. use strace -fc <program name>
Hope this helps!
Ftrace is a good option and has a good documentation.
use WARN_ON() It will print some trace of function called that.
For time tracing i think you should use time stamp showing in kernel log or use jiffies counter
Also systemtap will be useful in your situation. Systemtap is some kind of tool in which you can write code like in scripting languages. It is very powerful, but if you want to only know a time of execution particular function ftrace would be better, but if you need very advanced tool to analyze e.g, performance problems in the kernel space, it may be very helpful.
Pls read more: (what you want to do is here:- 5.2 Timing function execution times)
enter link description here
If the function's execution time is interesting because it makes subsidiary calls to slow/blocking functions, then statement-by-statement tracing could work for you, without too much distortion due to the "probe effect" overheads of the instrumentation itself.
probe kernel.statement("function_name#dir/file.c:*") { println(tid(), " ", gettimeofday_us(), " ", pn()) }
will give you a trace of each separate statement in the function_name. Deltas between adjacent statements are easily computed by hand or by a larger script. See also https://sourceware.org/systemtap/examples/#profiling/linetimes.stp
To get the precision that I needed (CPU cycles), I ended up using get_cycles() which is essentially a wrappeer for RDTSC (but portable). ftrace() may still be beneficial in the future, but all I'm doing now is taking the difference between start CPU cycles and end CPU cycles and using that as a benchmark.
Update: To avoid parallelization of instructions, I actually ended up wrapping RDTSCP instead. I couldn't use RDTSC + CPUID because that caused a lot of delays from hypercalls (I'm working in a VM).
Use systemtap and try this script:
https://github.com/openresty/stapxx#func-latency-distr
everyone, I am running the gprof to check the percentage execution time in two different optimization level (-g -pg vs -O3 -pg).
So I got the result that one function takes 68% exc-time in O3, but only 9% in -g version.
I am not sure how to find out the reason behind it. I am thinking compare the two version files before compiled, but i am not sure the cmd to do so.
Is there any other method to find out the reasons for this execution time difference.
You have to be careful interpreting gprof/profiling results when you're using optimization flags. Compiling with -O3 can really change the structure of your code, so that it's impossible for gprof to tell how much time is spent where.
In particular, function inlining enabled with the higher optimization levels make it that some of your functions will be completely replaced by inline code, so that they don't appear to take any time at all. The time that would be spent in those child functions is then attributed to the parent functions that call them, so it can look like the time spent in a given parent function actually increased.
I couldn't find a really good reference for this. Here's one old example:
http://gcc.gnu.org/ml/gcc/1998-04/msg00591.html
That being said, I would expect this kind of strange behavior when running gprof with -O3. I always do profiling with just -O1 optimization to minimize these kinds of effects.
I think that there's a fundamental flaw in your reasoning: that the fact that it takes 68% of execution time in the optimized version vs just the 9% in the unoptimized version means that the unoptimized version performs better.
I'm quite sure, instead, that the -O3 version performs better in absolute terms, but the optimizer did a way better job on the other functions, so, in proportion to the rest of the optimized code, the given subroutine results slower - but it's actually faster - or, at least, as fast - than the unoptimized version.
Still, to check directly the differences in the emitted code you can use the -S switch. Also, to see if my idea is correct, you can roughly compare the CPU time took by the function in -O0 vs -03 multiplying that percentage with the user time took by your program provided by a command like time (also, I'm quite sure that you can obtain a measure of absolute time spent in a subroutine in gprof, IIRC it was even in the default output).
I am using Intel C Compiler for I-32A architecture.
When I compile my C program with the following options:
icl mytest.c /openmp /QxHost /fp:fast /fast
The test run takes 3.3s. Now I tried to use PGO, so I compiled with:
icl mytest.c /openmp /QxHost /fp:fast /fast /Qprof-gen
I then run the executable with my sample input 2-3 times and compile again with:
icl mytest.c /openmp /QxHost /fp:fast /fast /Qprof-use
Hoping it will take into account collected information. It in fact tells me it's using the .dyn files but resulting executable is slower (3.85s) than without Qprof-use and this is on exactly the same data the runs were performed (should be perfect for PGO).
I tried setting openmp threads to one, thinking it might mess with .dyn output but the result is the same - it's slower than simple compilation.
My question is: is it even theoretically possible or I am messing up PGO process somehow with the compiler options ?
A 3.3-second floating-point application isn't going to see benefit from profile-guided optimization. From my guess, you're doing some sort of raw data crunching, which is better suited to hand-coded assembly if you need raw FLOPs than it is to PGO.
PGO will not tell the compiler how to optimize your inner loop to remove branch delays and keep the pipeline full. It may tell it if your loop is likely to run only 5,000 times or if your floats satisfy some criteria.
It is used with data that is statistically representative of other data you want it to run on. In other words you use it with data on a program that you want to be able to run other data with at a good clip. It doesn't necessarily optimize for the program at hand and, as you said, may even slow it down a bit for a possible net gain.
It really depends on your program but an OpenMP FP app is not what PGO is for. Like everything else it isn't a "magic bullet."