How to create a mini-benchmark program in C - c

I have an assignment where I need to make a benchmark program to test the performance of any processor with two sorting algorithms (an iterative one and a recursive one). The thing is my teacher told me I have to create three different programs (that is, 3 .c files), two with each sorting algorithm (both of them have to read integers from a text file separated with \n's and write the same numbers to another text file but sorted), and a benchmarking program. In the benchmark program I need to calculate the MIPs (million instructions per second) with the formula MIPs = NI/T*10^6, where NI is the number of instructions and T is the time required to execute those instructions. I have to be able to estimate the time each algorithm will take on any processor by calculating its MIPs and then solving that equation for T, like EstimatedTime = NI/MIPs*10^6.
My question is... how exactly do I measure the performance of a program with another program? I have never done something like that. I mean, I guess I can use the TIME functions in C and measure the time to execute X number of lines and stuff, but I can do that only if all 3 functions (2 sorting algorithms and 1 benchmark function) are in the same program. I don't even know how to start.
Oh and btw, I have to calculate the number of instructions by cross compiling the sorting algorithms from C to MIPS (the asm language) and counting how many instructions were used.
Any guidelines would be appreciated... I currently have these functions:
readfile (to read text files with ints on them)
writefile
sorting algorithms

On a Linux system, you can use hardware performance counters: perf stat ./a.out and get an accurate count of cycles, instructions, cache misses, and branch mispredicts. (other counters available, too, but those are the default ones).
This gives you the dynamic instruction count, counting instructions inside loops the number of times they actually ran.
Cross-compiling for MIPS and counting instructions would easily give you a static instruction count, but would require actually following how the asm works to figure out how many times each loop runs.

How you compile the several files and link them together depends on the compiler. With GCC for example it could be something as simple as
gcc -O3 -g3 -W -Wall -Wextra main.c sortalog1.c sortalgo_2.c [...] sortalgo_n.c -o sortingbenchmark
It's not the most common way to do it, but good enough for this assignment.
If you want to count the opcodes it is probably better to compile the individual c-files individually to ASM. Do the following for every C-file you want to analyze the assembler output:
gcc -c -S sortalgo_n.c
Don't forget to put your function declarations into a common header file and include it everywhere you use them!
For benchmarking: you do know the number of ASM-operations for every C-operation and can, although it's not easy, map that count to every line of the C code. If you have that, all you have to do is to increment a counter. E.g.: if a line of C-code translates to 123 ASM opcodes you increment the counter by 123.
You can use one global variable to do so. If you use more than one thread per sorting algorithm you need to take care that the additions are atomic (Either use _Atomic or mutexes or whatever your OS/compiler/libraries offer).
BTW: it looks like a very exact way to measure the runtime but not every ASM-opcode runs in the same number of cycles on the CPU in the real world. No need for bothering today but you should keep it in mind for tomorrow.

Related

Performance comparison of a program: C / Assembly

I have a code in C and in assembly (x86 on linux) and I would like to compare their speed, but I don't know how to do it.
In C, the time.h library allows us to know the execution time of the program. But I don't know how to do that in assembly.
I have found the instruction rdtsc which allows us to know the number of clock cycles between two pieces of code. But I have the impression that there is a huge noise on the returned value (maybe because of what is running on the pc?) I don't see then how to compare the speed of these two programs. The time observed in the command prompt is apparently not a reference...
How should I proceed ? Thanks
I have tried to substitute values that I got with the assembly programm with the values I got from an empty code in order to have an average value, but values are still incoherent

How to print data about the execution of my code ?

When programming in haskell we have the interpreter option :set +s. It prints some information about the code you ran. When on ghci, prints the time spent on running the code and the number of bytes used. when on hugs, prints the number of reductions made by the interpreter and the number of bytes used. How can I do the same thing in C ? I know how to print the time spent running my c code and how to print the number of clocks spent by the processor to run it. But what about the number of bytes and reductions ? I want to know a good way to compare two differents codes that do the same thing and compare which is the most efficient for me.
Thanks.
If you want to compare performance, just compare time and used memory. Allow both programs exploit the same number of processor cores, write equivalent programs in both language and run benchmarks. If you are using a Unix, time(1) is your friend.
Everything else is not relevant to performance. If a program performed 10x more functions calls than another one, but ran in half of the time, it is still the one having the better performance.
The benchmark game web site compares different language using time/space criteria. You may wish to follow the same spirit.
For more careful profiling of portions of the programs, rather than the whole program, you can either use a profiler (in C) or turn on the profiling options (in GHC Haskell). Criterion is also a popular Haskell library to benchmark Haskell programs. Profiling is typically useful to spot the "hot points" in the code: long-running loops, frequently called functions, etc. This is useful because it allows the programmer to know where optimization is needed. For instance, if a function cumulatively runs for 0.05s, obtaining a 10x speed increase on that is far less useful than a 5% optimization on a function cumulatively running for 20 minutes (0.045s vs 60s gain).

PGO slower than static optimization (intel compiler)

I am using Intel C Compiler for I-32A architecture.
When I compile my C program with the following options:
icl mytest.c /openmp /QxHost /fp:fast /fast
The test run takes 3.3s. Now I tried to use PGO, so I compiled with:
icl mytest.c /openmp /QxHost /fp:fast /fast /Qprof-gen
I then run the executable with my sample input 2-3 times and compile again with:
icl mytest.c /openmp /QxHost /fp:fast /fast /Qprof-use
Hoping it will take into account collected information. It in fact tells me it's using the .dyn files but resulting executable is slower (3.85s) than without Qprof-use and this is on exactly the same data the runs were performed (should be perfect for PGO).
I tried setting openmp threads to one, thinking it might mess with .dyn output but the result is the same - it's slower than simple compilation.
My question is: is it even theoretically possible or I am messing up PGO process somehow with the compiler options ?
A 3.3-second floating-point application isn't going to see benefit from profile-guided optimization. From my guess, you're doing some sort of raw data crunching, which is better suited to hand-coded assembly if you need raw FLOPs than it is to PGO.
PGO will not tell the compiler how to optimize your inner loop to remove branch delays and keep the pipeline full. It may tell it if your loop is likely to run only 5,000 times or if your floats satisfy some criteria.
It is used with data that is statistically representative of other data you want it to run on. In other words you use it with data on a program that you want to be able to run other data with at a good clip. It doesn't necessarily optimize for the program at hand and, as you said, may even slow it down a bit for a possible net gain.
It really depends on your program but an OpenMP FP app is not what PGO is for. Like everything else it isn't a "magic bullet."

what does the -p and -g flag in compiler

I have been profiling a C code and to do so I compiled with -p and -g flags. So I was wandering what do these flags actually do and what overhead do they add to the binary?
Thanks
Assuming you are using GCC, you can get this kind of information from the GCC manual
http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options
-p
Generate extra code to write profile information suitable for the
analysis program prof. You must use this option when compiling the
source files you want data about, and you must also use it when
linking.
-g
Produce debugging information in the operating system's native format
(stabs, COFF, XCOFF, or DWARF 2). GDB can work with this debugging
information.
On most systems that use stabs format, -g enables use of extra debugging information that only GDB can use; this extra information
makes debugging work better in GDB but will probably make other
debuggers crash or refuse to read the program. If you want to control
for certain whether to generate the extra information, use -gstabs+,
-gstabs, -gxcoff+, -gxcoff, or -gvms (see below).
GCC allows you to use -g with -O. The shortcuts taken by optimized code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values were already at hand;
some statements may execute in different places because they were
moved out of loops.
Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have
bugs.
-p provides information for prof, and -pg provides information for gprof.
Let's look at the latter.
Here's an explanation of how gprof works,
but let me condense it here.
When a routine B is compiled with -pg, some code is inserted at the routine's entry point that looks up which routine is calling it, say A.
Then it increments a counter saying that A called B.
Then when the code is executed, two things are happening.
The first is that those counters are being incremented.
The second is that timer interrupts are occurring, and there is a counter for each routine, saying how many of those interrupts happened when the PC was in the routine.
The timer interrupts happen at a certain rate, like 100 times per second.
Then if, for example, 676 interrupts occurred in a routine, you can tell that its "self time" was about 6.76 seconds, spread over all the calls to it.
What the call counts allow you to do is add them up to tell how many times a routine was called, so you can divide that into its total self time to estimate how much self time per call.
Then from that you can start to estimate "cumulative time".
That's the time spent in a routine, plus time spent in the routines that it calls, and so on down to the bottom of the call tree.
This is all interesting technology, from 1982, but if your goal is to find ways to speed up your program, it has a lot of issues.

how to count cycles?

I'm trying to find the find the relative merits of 2 small functions in C. One that adds by loop, one that adds by explicit variables. The functions are irrelevant themselves, but I'd like someone to teach me how to count cycles so as to compare the algorithms. So f1 will take 10 cycles, while f2 will take 8. That's the kind of reasoning I would like to do. No performance measurements (e.g. gprof experiments) at this point, just good old instruction counting.
Is there a good way to do this? Are there tools? Documentation? I'm writing C, compiling with gcc on an x86 architecture.
http://icl.cs.utk.edu/papi/
PAPI_get_real_cyc(3) - return the total number of cycles since some arbitrary starting point
Assembler instruction rdtsc (Read Time-Stamp Counter) retun in EDX:EAX registers the current CPU ticks count, started at CPU reset. If your CPU runing at 3GHz then one tick is 1/3GHz.
EDIT:
Under MS windows the API call QueryPerformanceFrequency return the number of ticks per second.
Unfortunately timing the code is as error prone as visually counting instructions and clock cycles. Be it a debugger or other tool or re-compiling the code with a re-run 10000000 times and time it kind of thing, you change where things land in the cache line, the frequency of the cache hits and misses, etc. You can mitigate some of this by adding or removing some code upstream from the module of code being tested, (to cause a few instructions added and removed changing the alignment of your program and sometimes of your data).
With experience you can develop an eye for performance by looking at the disassembly (as well as the high level code). There is no substitute for timing the code, problem is timing the code is error prone. The experience comes from many experiements and trying to fully understand why adding or removing one instruction made no or dramatic differences. Why code added or removed in a completely different unrelated area of the module under test made huge performance differences on the module under test.
As GJ has written in another answer I also recommend using the "rdtsc" instruction (rather than calling some operating system function which looks right).
I've written quite a few answers on this topic. Rdtsc allows you to calculate the elapsed clock cycles in the code's "natural" execution environment rather than having to resort to calling it ten million times which may not be feasible as not all functions are black boxes.
If you want to calculate elapsed time you might want to shut off energy-saving on the CPUs. If it's only a matter of clock cycles this is not necessary.
If you are trying to compare the performance, the easiest way is to put your algorithm in a loop and run it 1000 or 1000000 times.
Once you are running it enough times that the small differences can be seen, run time ./my_program which will give you the amount of processor time that it used.
Do this a few times to get a sampling and compare the results.
Trying to count instructions won't help you on x86 architecture. This is because different instructions can take significantly different amounts of time to execute.
I would recommend using simulators. Take a look at PTLsim it will give you the number of cycles, other than that maybe you would like to take a look at some tools to count the number of times each assembly line is executing.
Use gcc -S your_program.c. -S tells gcc to generate the assembly listing, that will be named your_program.s.
There are plenty of high performance clocks around. QueryPerformanceCounter is microsofts. The general trick is to run the function 10s of thousands of time and time how long it takes. Then divide the time taken by the number of loops. You'll find that each loop takes a slightly different length of time so this testing over multiple passes is the only way to truly find out how long it takes.
This is not really a trivial question. Let me try to explain:
There are several tools on different OS to do exactly what you want, but those tools are usually part of a bigger environment. Every instruction is translated into a certain number of cycles, depending on the CPU the compiler ran on, and the CPU the program was executed.
I can't give you a definitive answer, because I do not have enough data to pass my judgement on, but I work for IBM in the database area and we use tools to measure cycles and instructures for our code and those traces are only valid for the actual CPU the program was compiled and was running on.
Depending on the internal structure of your CPU's piplining and on the effeciency of your compiler, the resulting code will most likely still have cache misses and other areas you have to worry about. (In that case you may want to look into FDPR...)
If you want to know how many cycles your program needs to run on your CPU (which was compiled with your compiler), you have to understand how the CPU works and how the compiler generarated the code.
I'm sorry, if the answer was not sufficient enough to solve your problem at hand. You said you are using gcc on an x86 arch. I would work with getting the assembly code mapped to your CPU.
I'm sure you will find some areas, where gcc could have done a better job...

Resources