Performance bechmarking of MPI program in C - benchmarking

I am new to MPI.Can anyone please suggest me how to do benchmarking of MPI programs in C. Cluster I am using is running Rocks 4.3(Mars Hill).

You could put an MPI_Barrier call at the beginning of the program, and then have each process record the time since the epoch and compare the final time-since-epoch with that at the end...

Here are some parts of the answer, focussing on execution time:
Familiarise yourself with the MPI_WTIME function.
Have a poke around the Top 500 and see what they do about benchmarking. This might spark some ideas.
Plan to compare execution time for: serial program, MPI program running on one processor, MPI program running on N processors (for a range of Ns). Much benchmarking of parallel programs is about assessing their parallel scalability.
There's lots more, refine your question and you might get more apposite answers.

Related

About Dijkstra omp

Recently I've download a source code from internet of the OpenMP Dijkstra.
But I found that the parallel time will always larger than when it is run by one thread (whatever I use two, four or eight threads.)
Since I'm new to OpenMP I really want to figure out what happens.
The is due to the overheard of setting up the threads. The execution time of the work itself is theoretically the same, but the system has to set up the threads that manage the work (even if there's only one). For little work, or for only one thread, this overhead time makes your time-to-solution slower than the serial time-to-solution.
Alternatively, if you see the time increasing dramatically as you increase the thread-count, you could only be using 1 core on your computer and tricking it into thinking its 2,4,8, etc threads.
Finally, it's possible that the way you're implementing dijkstra's method is largely serial. But without looking at your code it would be too hard to say.

Performance of System()

For the function in c, system(), would it affect the hardware counters if you are trying to see how that command you ran performed
For example lets say im using the Performance API(PAPI) and the program is a precompiled matrix multiplication application
PAPI_start_counters();
system("./matmul");
PAPI_read_counters();
//Print out values
PAPI_stop_counters();
I am obviously missing a bit but what I am trying to find out is it is possible, through the use of said counters to get the performance of a program im running.
from my tests I would get wild numbers like the ones below. they are obviously wrong, just want to find out why
Total Cycles =========== 140733358872510
Instructions Completed =========== 4203968
Floating Point Instructions =========== 0
Floating Point Operations =========== 4196867
Loads =========== 140733358872804
Stores =========== 4204037
Branches Taken =========== 15774436
system() is a very slow function in general. On Linux, it spawns /bin/sh (forking and executing a full shell process), which parses your command, and spawns the second program. Loading these two programs requires loading the code to memory, initializing all their libraries, executing startup code, etc. Only then will the program code actually start executing.
Because of the unpredictability of disk access and Linux process scheduling, timing system() calls has a very high inherent variability. Therefore, you won't get accurate results even if you use a high-performance counter.
The better solution would be to compile the target program as a library instead. Load it before initializing your counters, then just execute the main function from the library. That way, all the code executes in your process, and you have negligible startup time. Your performance numbers will be much more precise this way.
Do you have access to the code of matmul? If so, it's much more precise to instrument and measure only the code you're interested in. That means you wrap only those instructions (or C statements) in counters that you want to measure.
For more information see:
Related discussion here
Intel® Performance Counter Monitor here
Performance measurements with x86 RDTSC instruction here
As stated above, measuring using PAPI to wrap system() invocations carries way too much process overhead to give you any idea of how fast your math code is actually running.
The numbers you are getting are odd, but not necessarily wrong. The huge disparity between the instructions completed and the cycles probably indicate that the executable "matmul" is doing a lot of waiting for external processes (e.g. disk I/O) to complete. I do not know the specifics of the msg FP Instructions and FP ops, but if they are displaying those values differently PAPI has a reason.
What is interesting is that the loads and cycles are obviously connected as well as instructions/fp ops and stores.
I would have to know about the internals of "matmul" in order to give you a better description.

Why using gprof blocks the execution of a program?

I am using gprof to calculate the time spent during the execution of my program, for each function .
The last week I noticed that when CPU usage reached 100%, the program could not even start !
The code run for almost a day and nothing changed.
The CPU usage reaching 100% in some cases is inevitable and specially when I want to stress out my system and test the program while it uses the maximum amount of resources, with the help of the "stress" tool : http://weather.ou.edu/~apw/projects/stress/
I have read the thread :
Alternatives to gprof
and read the Mike Dunlavey's response :
What about problems that are not so localized? Do those not matter?
Don't place expectations on gprof that were never claimed for it. It
is only a measurement tool, and only of CPU-bound operations.
and also Norman Ramsey's response that had the high score :
Valgrind has an instruction-count profiler with a very nice visualizer called KCacheGrind. As Mike Dunlavey recommends, Valgrind counts the fraction of instructions for which a procedure is live on the stack, although I'm sorry to say it appears to become confused in the presence of mutual recursion. But the visualizer is very nice and light years ahead of gprof.
but as the thread is closed, as non constructive, I was wondering if this is the good direction to follow
Thanks in advance
P.S. While using google search, I didn't find something relevant when asking questions like
"why gprof doesn't work when cpu reach 100 %"
Thanks in advance
All that 100% means is it's hung, and it's not doing I/O.
You're saying the program hangs when you run it with gprof, but not if you don't?
That's weird, but I wouldn't bother trying to figure it out.
As I've said over and over, I would just grab several stack samples manually.
Then the percent of time used by any routine is just the fraction of samples it appears on, more or less.
If you think you need high-precision measurements, try a stack-sampler like Zoom or OProfile.

Parallel Demonstration Program

An assignment that I've just now completed requires me to create a set of scripts that can configure random Ubuntu machines as nodes in an MPI computing cluster. This has all been done and the nodes can communicate with one another properly. However, I would now like to demonstrate the efficiency of said MPI cluster by throwing a parallel program at it. I'm just looking for a straight brute force calculation that can divide up work among the number of processes (=nodes) available: if one node takes 10 seconds to run the program, 4 nodes should only take about 2.5.
With that in mind I looked for a prime calculation programs written in C. For any purists, the program is not actually part of my assignment as the course I'm taking is purely systems management. I just need anything that will show that my cluster is working. I have some programming experience but little in C and none with MPI. I've found quite a few sample programs but none of those seem to actually run in parallel. They do distribute all the steps among my nodes so if one node has a faster processor the overall time will go down, but adding additional nodes does nothing to speed up the calculation.
Am I doing something wrong? Are the programs that I've found simply not parallel? Do I need to learn C programming for MPI to write my own program? Are there any other parallel MPI programs that I can use to demonstrate my cluster at work?
EDIT
Thanks to the answers below I've managed to get several MPI scripts working, among which the sum of the first N natural numbers (which isn't very useful as it runs into data type limits), the counting and generating of prime numbers and the Monte Carlo calculation of Pi. Interestingly only the prime number programs realise a (sometimes dramatic) performance gain with multiple nodes/processes.
The issue that caused most of my initial problems with getting scripts working was rather obscure and apparently due to issues with hosts files on the nodes. Running mpiexec with the -disable-hostname-propagation parameter solved this problem, which may manifest itself in a variety of ways: MPI(R) barrier errors, TCP connect errors and other generic connection failures. I believe it may be necessary for all nodes in the cluster to know one another by hostname, which is not really an issue in classic Beowulf clusters that have DHCP/DNS running on the server node.
The usual proof of concept application in parallel programming is simple raytracing.
That being said, I don't think that raytracing is a good example to show off the power of OpenMPI. I'd put the emphasis on scatter/gather or even better scatter/reduce, because that's where MPI gets the true power :)
the most basic example for that would be calculating the sum over the first N integers. You'll need to have a master thread, that fits value ranges to sum over into an array, and scatter these ranges over the number of workers.
Then you'll need to do a reduction and check your result against the explicit formula, to get a free validation test.
If you're looking for a weaker spot of MPI, a parallel grep might work, where IO is the bottleneck.
EDIT
You'll have to keep in mind that MPI is based on a shared nothing architecture where the nodes communicate using messages, and that the number of nodes is fixed. these two factors set a very tight frame for the programs that run on it. To make a long story short, this kind of parallelism is great for data-parallel applications, but sucks for task-parallel applications, because you can usually distribute data better than tasks if the number of nodes changes.
Also, MPI has no concept of implicit work-stealing. if a node is finished working, it just sits around waiting for the other nodes to finish. that means, you'll have to figure out weakest-link handling yourself.
MPI is very customizable when it comes to performance details, there are numerous different variants of MPI_SEND, for example. That leaves much room for performance tweaking, which is important for high performance computing, for which MPI was designed, but is mostly confusing "ordinary" programmers, leading to programs that actually get slower when run parallel. maybe your examples just suck :)
And on the scaleup / speedup problem, well...
I suggest that you read into Amdahl's Law, and you'll see that it's impossible to get linear speedup by just adding more nodes :)
I hope that helped. If you still have questions, feel free to drop a comment :)
EDIT2
maybe the best scaling problem that integrates perfectly with MPI is the empiric estimation of Pi.
Imaging a quarter circle with the radius 1, inside a square with sides of length 1, then you can estimate Pi by firing random points into the square and calculate if they're inside of the quarter circle.
note: this is equal to generating tuples (x,y) with x,y in [0, 1] and measuring how many of these have x² + y² <= 1.
Pi is then roughly equal to
4 * Points in Circle / total Points
In MPI you'd just have to gather the ratios generated from all threads, which is very little overhead and thus gives a perfect proof of concept problem for your cluster.
Like with any other computing paradigm, there are certain well established patterns in use with distributed memory programming. One such pattern is the "bag of jobs" or "controller/worker" (previously known as "master/slave", but now the name is considered politically incorrect). It is best suited for your case because:
under the right conditions it scales with the number of workers;
it is easy to implement;
it has built-in load balancing.
The basic premises are very simple. The "controller" process has a big table/queue of jobs and practically executes one big loop (possibly an infinite one). It listens for messages from "worker" processes and responds back. In the simplest case workers send only two types of messages: job requests or computed results. Consequently, the controller process sends two types of messages: job descriptions or termination requests.
And the canonical non-trivial example of this pattern is colouring the Mandelbrot set. Computing each pixel of the final image is done completely independent from the other pixels, so it scales very well even on clusters with high-latency slow network connects (e.g. GigE). In the extreme case each worker can compute a single pixel, but that would result in very high communication overhead, so it is better to split the image in small rectangles. One can find many ready-made MPI codes that colour the Mandelbrot set. For example this code uses row decomposition, i.e. a single job item is to fill one row of the final image. If the number of MPI processes is big, one would have to have fairly large image dimensions, otherwise the load won't balance well enough.
MPI also has mechanisms that allow spawning additional processes or attaching externally started jobs in client/server fashion. Implementing them is not rocket science, but still requires some understanding of advanced MPI concepts like intercommunicators, so I would skip that for now.

Can you ask for additional processors on the fly in MPI?

With MPI in C, you can do the following to run a program:
mpirun -np 5 program
where 5 is the number of processors to use and program is the program to run on these processors. Is it possible to request x processors like above but then whilst the program is running, if you decide you need y processors (let's say y>x), can you request more without restarting the program?
If the answer is yes, then how do you do this? If no then why not?
Many thanks.
This is possible; but it's not trivial. The application needs to have been coded to support this. Fundamentally, there's the issue that mpi provides various global communication and synchronization primitives, and it's not clear what to do with such operations when adding new parallelism - you wouldn't want new processes to non-deterministically deadlock or crash others, after all.
Here's some documentation on IBM's site - any MPI2 implementation should be conform to the same outline. Jonathan points out that the MPI Specification itself includes a pretty good example of doing this for a master-worker sort of problem.

Resources