Is there an analog of `mpirun -n` for a function/subroutine? - c

I'd like to benchmark OMP and MPI. I already have a function which is properly implemented as a serial function, as a function using openmp, and as a function using MPI.
Now, I would like to benchmark the implementations and see how it scales with the number of threads.
I could do it by hand in the following way:
$./my_serial
which returns the average execution time for some computations,
then $./my_OMP <number of threads>, where I pass the number of threads to an expression #pragma omp parallel for private(j) shared(a) num_threads(nthreads) with j a running variable and a an array that contains doubles, which also returns the average execution time.
I automated the part that calls my_OMP with increasing number of threads. Now, I would like to implement something similar using a function that is based on MPI. Again, by hand I would do:
$mpirun -np <number_of_threads> ./my_mpi,
which would return me the average computation time and then I would increase the number of threads.
So, instead of calling my_mpi several times by hand and noting down the return value, I'd like to automate this, i.e. I'd like to have a loop that increases the number of threads and stores the return values in an array.
How can I implement that?
I thought about running the program once with $mpirun -np MAX_NUMBER_OF_PARALLEL_THREADS ./my_mpi and then limiting the number of parallel threads to 1 for the serial and OMP part and increasing the limit of parallel threads in a loop for the actual MPI part, but I don't know how to to this.

You're misunderstanding how MPI works. MPI is process bases, so mpirun -n 27 yourprogram starts 27 instances of your executable, and they communicate through OS means. Thus there is no "serial part": each process executes every single statement in your source.
What you could do is
give your timed routine a communicator argument
make a series of increasing sized subcommunicators, and
call your timed routine with those subcommunicators.

Related

OpenMP impact on performance

I am trying to parallelize a script using openMP, but when I measure the execution time of it (using omp_get_thread_num) the results are preety odd,
if I set the number of threads to 2 it measures 4935 us
setting it to 1 takes around 1083 us
and removing every openmp directive turns that into only 9 us
Here's the part of the script I'm talking about (this loop is nested inside another one)
for(j=(i-1); j>=0;j--){
a=0;
#pragma omp parallel
{
#pragma omp single
{
if(arreglo[j]>y){
arreglo[j+2]=arreglo[j];
}
else if(arreglo[j]>x){
if(!flag[1]){
arreglo[j+2]=y;
flag[1]=1;
}
arreglo[j+1]=arreglo[j];
}
}
#pragma omp single
{
if(arreglo[j]<=x){
arreglo[j+1]=x;
flag[0]=1;
a=1;
}
}
#pragma omp barrier
}
if (a==1){break;}
}
What could be the cause of this differences? some sort of bottleneck, or it's just the added cost of sychronization ?
We are talking about a really short execution time, which can be easily affected by the environment used for the benchmark;
You are clearly using an input size that does not justify the overhead of the parallelism.;
Your current design only allows for 2 threads; no room for scaling;
Instead of using the single constructor, you might as well just statically divide those two code branches based upon the thread ID, you would save the overhead of the single constructor;
That last barrier is redundant since the #pragma omp parallel has already an implicit barrier at the of it.
Furthermore, your code just looks intrinsically sequential, and with the current design, the code is clearly not suitable for parallelism.
if i set the number of threads to 2 it measures 4935 us setting it to
1 takes around 1083 us and removing every openmp directive turns that
into only 9 us
With 2 threads you are paying all that synchronization overhead, with 1 thread you are paying the price of having the openMP there. Finally, without the parallelization, you just removed all that overhead, hence the lower execution time.
Btw you do not need to remove the OpenMP directives, just compile the code without -fopenmp flag, and the directives will be ignored.

In openMP how do I ensure threads are synchronized before continuing?

I am using a #pragma omp barrier to ensure that all my parallel threads meet up at the same point before continuing (no fancy conditionally branching code, just straight loop), but I am surmising that the barrier pragma does not actually guarantee synchronicity, just completion as these are the results I am getting:
0: func() size: 64 Time: 0.000414 Start: 1522116688.801262 End: 1522116688.801676
1: func() size: 64 Time: 0.000828 Start: 1522116688.801263 End: 1522116688.802091
thread 0 is starting about a microsecond faster than thread 1, giving it the somewhat unrealistic completion time of 0.414 msec, incidentally in a single core/thread run the run time averages around 0.800 msec. (please forgive me if my units are off, it is late).
My Question is: Is there a way to ensure in openMP that threads are all started at the same time? Or would I have to bring in another library like pthread in order to have this functionality?
The barrier statement in OpenMP, as in other languages, ensures no thread progresses until all threads reach the barrier.
It does not specify the order in which threads begin to execute again. As far as I know manually scheduling threads is not possible in OpenMP or Pthread libraries(see comment below).

MPI calls are slow in an OpenMP section

I am attempting to write a hybrid MPI + OpenMP linear solver within the PETSc framework. I am currently running this code on 2 nodes, with 2 sockets per node, and 8 cores per socket.
export OMP_MAX_THREADS=8
export KMP_AFFINITY=compact
mpirun -np 4 --bysocket --bind-to-socket ./program
I have checked that this gives me a nice NUMA-friendly thread distribution.
My MPI program creates 8 threads, 1 of which should perform MPI communications while the remaining 7 perform computations. Later, I may try to oversubscribe the sockets with 9 threads each.
I currently do it like this:
omp_set_nested(1);
#pragma omp parallel sections num_threads(2)
{
// COMMUNICATION THREAD
#pragma omp section
{
while(!stop)
{
// Vector Scatter with MPI Send/Recv
// Check stop criteria
}
}
// COMPUTATION THREAD(S)
#pragma omp section
{
while(!stop)
{
#pragma omp parallel for num_threads(7) schedule(static)
for (i = 0; i < n; i++)
{
// do some computation
}
}
}
}
My problem is that the MPI communications take an exceptional amount of time, just because I placed them in the OpenMP section. The vector scatter takes approximately 0.024 seconds inside the OpenMP section, and less than 0.0025 seconds (10 times faster) if it is done outside of the OpenMP parallel region.
My two theories are:
1) MPI/OpenMP is performing extra thread-locking to ensure my MPI calls are safe, even though its not needed. I have tried forcing MPI_THREAD_SINGLE, MPI_THREAD_FUNELLED and MPI_THREAD_MULTIPLE to see if I can convince MPI that its already safe, but this had no effect. Is there something I'm missing?
2) My computation thread updates values used by the communications (its actually a deliberate race condition - as if this wasn't awkward enough already!). It could be that I'm facing memory bottlenecks. It could also be that I'm facing cache thrashing, but I'm not forcing any OpenMP flushes, so I don't think its that.
As a bonus question: is an OpenMP flush operation clever enough to only flush to the shared cache if all the threads are on the same socket?
Additional Information: The vector scatter is done with the PETSc functions VecScatterBegin() and VecScatterEnd(). A "raw" MPI implementation may not have these problems, but its a lot of work to re-implement the vector scatter to find out, and I'd rather not do that yet. From what I can tell, its an efficient loop of MPI Send/Irecvs beneath the surface.

OpenCL for loop execution model

I'm currently learning OpenCL and came across this code snippet:
int gti = get_global_id(0);
int ti = get_local_id(0);
int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
float4 p2 = pblock[j]; /* Read a cached particle position */
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
}
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}
Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.
Here are my questions (mainly to do with the for loops):
How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.
I hope my questions are clear enough and I appreciate any input.
1) How exactly are for-loops executed in OpenCL? I know that all
work-items run the same code and that work-items within a work group
tries to execute in parallel. So if I run a for loop in OpenCL, does
that mean all work-items run the same loop or is the loop somehow
divided up to run across multiple work items, with each work item
executing a part of the loop (ie. work item 1 processes indices 0 ~ 9,
item 2 processes indices 10 ~ 19, etc).
You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.
In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:
if(condition is true)
do_a();
else
do_b();
logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.
Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.
2) In this code snippet, how does the outer and inner loops execute?
Does OpenCL know that the outer loop is dividing the work among all
the work groups and that the inner loop is trying to divide the work
among work-items within each work group?
As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.
In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,
unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];
this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.
3) If the inner loop is divided among the work-items (meaning that the
code within the for loop is executed in parallel, or at least
attempted to), how does the addition at the end work? It is
essentially doing a = a + f*d, and from my understanding of pipelined
processors, this has to be executed sequentially.
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).
Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.
Heterogeneous programming works on work distribution model, meaning threads gets its portion to work on and start on it.
1.1) As you know that, threads are organized in work-group (or thread block) and in your case each thread in work-group (or thread-block) bringing data from global memory to local memory.
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti];
//I assume pblock is local memory
1.2) Now all threads in thread-block have the data they need at there local storage (so no need to go to global memory anymore)
1.3) Now comes processing, If you look carefully the for loop where processing takes place
for(int j=0; j<nt; j++) {
which runs for total number of thread blocks. So this loop snippet design make sure that all threads process separate data element.
1) for loop is just like another C statement for OpenCL and all thread will execute it as is, its up-to you how you divide it. OpenCL will not do anything internally for your loop (like point # 1.1).
2) OpenCL don't know anything about your code, its how you divide the loops.
3) Same as statement:1 the inner loop is not divided among the threads, all threads will execute as is, only thing is they will point to the data which they want to process.
I guess this confusion for you is because you jumped into the code before having much knowledge on thread-block and local memory. I suggest you to see the initial version of this code where there is no use of local memory at all.
How exactly are for-loops executed in OpenCL?
They can be unrolled automatically into pages of codes that make it slower or faster to complete. SALU is used for loop counter so when you nest them, more SALU pressure is done and becomes a bottleneck when there are more than 9-10 loops nested (maybe some intelligent algorithm using same counter for all loops should do the trick) So not doing only SALU in the loop body but adding some VALU instructions, is a plus.
They are run in parallel in SIMD so all threads' loops are locked to each other unless there is branching or memory operation. If one loop is adding something, all other threads' loops adding too and if they finish sooner they wait the last thread computing. When they all finish, they continue to next instruction (unless there is branching or memory operation). If there is no local/global memory operation, you dont need synchronization. This is SIMD, not MIMD so it is not efficient when loops are not doing same thing at all threads.
In this code snippet, how does the outer and inner loops execute?
nb and nt are constants and they are same for all threads so all threads doing same amount of work.
If the inner loop is divided among the work-items
That needs opencl 2.0 which has the ability of fine-grain optimization(and spawning kernels in kernel).
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/
Look for "subgroup-level functions" and "region growing" titles.
All subgroup threads would have their own accumulators which are then added in the end using a "reduction" operation for speed.

Pseudo Code for Producer Consumer Synchronization

I'm having some trouble writing Pseduocode for a homework assignment in my operating systems class in which we are programming in C.
You will be implementing a Producer-Consumer program with a bounded buffer queue of N elements, P producer threads and C consumer threads
(N, P and C should be command line arguments to your program, along with three additional parameters, X, Ptime and Ctime, that are described below). Each
Producer thread should Enqueue X different numbers onto the queue (spin-waiting for Ptime*100,000 cycles in between each call to Enqueue). Each Consumer thread
should Dequeue P*X/C items from the queue (spin-waiting for Ctime*100,000 cycles
in between each call to Dequeue). The main program should create/initialize the
Bounded Buffer Queue, print a timestamp, spawn off C consumer threads & P
producer threads, wait for all of the threads to finish and then print off another
timestamp & the duration of execution.
My main difficulty is understanding what my professor means by spin-waiting for the variables times 100,000. I have bolded the section that is confusing me.
I understand a time stamp will be used to print the difference between each thread. We are using semaphores and implementing synchronization at the moment. Any suggestions on the above queries would be much appreciated.
I'm guessing it means busy-waiting; repeatedly checking the loop condition and consuming unnecessary CPU power in a tight loop:
while (current_time() <= wake_up_time);
One would ideally use something that suspends your thread until it's woken up externally, by the scheduler (so resources such as the CPU can be diverted elsewhere):
sleep(2 * 60 * 1000 ms);
or at least give up some CPU (i.e. not be so tight):
while (current_time() <= wake_up_time)
sleep(100 ms);
But I guess they don't want you to manually invoke the scheduler, hinting the OS (or your threading library) that it's a good time to make a context switch.
I'm not sure what cycles are; in assembly they might be CPU cycles but given that your question is tagged C, I'll bet that they're simply loop iterations:
for (int i=0; i<Ptime*100000; ++i); //spin-wait for Ptime*100,000 cycles
Though it's always safest to ask whoever issued the homework.
busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input is available, or if a lock is available.
so the assignment says to wait for Ptime*100000 time before producing next element and enqueue x different elements after the condition is true
similarly Each Consumer thread
should Dequeue P*X/C items from the queue and wait for ctime*100000 after every consumption of item
I suspect that your professor is being a complete putz - by actually ASKING for the worste "busy waiting" technique in existance:
int n = pTime * 100000;
for ( int i=0; i<n; ++i) ; // waste some cycles.
I also suspect that he still uses a pterosaur thigh-bone as a walking stick, has a very nice (dry) cave, and a partner with a large bald patch.... O/S guys tend to be that way. It goes with the cool beards.
No wonder his thoroughly modern students misunderstand him. He needs to (re)learn how to grunt IN TUNE.
Cheers. Keith.

Resources