MPI calls are slow in an OpenMP section

MPI calls are slow in an OpenMP section - c

I am attempting to write a hybrid MPI + OpenMP linear solver within the PETSc framework. I am currently running this code on 2 nodes, with 2 sockets per node, and 8 cores per socket.
export OMP_MAX_THREADS=8
export KMP_AFFINITY=compact
mpirun -np 4 --bysocket --bind-to-socket ./program
I have checked that this gives me a nice NUMA-friendly thread distribution.
My MPI program creates 8 threads, 1 of which should perform MPI communications while the remaining 7 perform computations. Later, I may try to oversubscribe the sockets with 9 threads each.
I currently do it like this:
omp_set_nested(1);
#pragma omp parallel sections num_threads(2)
{
// COMMUNICATION THREAD
#pragma omp section
{
while(!stop)
{
// Vector Scatter with MPI Send/Recv
// Check stop criteria
}
}
// COMPUTATION THREAD(S)
#pragma omp section
{
while(!stop)
{
#pragma omp parallel for num_threads(7) schedule(static)
for (i = 0; i < n; i++)
{
// do some computation
}
}
}
}
My problem is that the MPI communications take an exceptional amount of time, just because I placed them in the OpenMP section. The vector scatter takes approximately 0.024 seconds inside the OpenMP section, and less than 0.0025 seconds (10 times faster) if it is done outside of the OpenMP parallel region.
My two theories are:
1) MPI/OpenMP is performing extra thread-locking to ensure my MPI calls are safe, even though its not needed. I have tried forcing MPI_THREAD_SINGLE, MPI_THREAD_FUNELLED and MPI_THREAD_MULTIPLE to see if I can convince MPI that its already safe, but this had no effect. Is there something I'm missing?
2) My computation thread updates values used by the communications (its actually a deliberate race condition - as if this wasn't awkward enough already!). It could be that I'm facing memory bottlenecks. It could also be that I'm facing cache thrashing, but I'm not forcing any OpenMP flushes, so I don't think its that.
As a bonus question: is an OpenMP flush operation clever enough to only flush to the shared cache if all the threads are on the same socket?
Additional Information: The vector scatter is done with the PETSc functions VecScatterBegin() and VecScatterEnd(). A "raw" MPI implementation may not have these problems, but its a lot of work to re-implement the vector scatter to find out, and I'd rather not do that yet. From what I can tell, its an efficient loop of MPI Send/Irecvs beneath the surface.

Related

Is there an analog of `mpirun -n` for a function/subroutine?

I'd like to benchmark OMP and MPI. I already have a function which is properly implemented as a serial function, as a function using openmp, and as a function using MPI.
Now, I would like to benchmark the implementations and see how it scales with the number of threads.
I could do it by hand in the following way:
$./my_serial
which returns the average execution time for some computations,
then $./my_OMP <number of threads>, where I pass the number of threads to an expression #pragma omp parallel for private(j) shared(a) num_threads(nthreads) with j a running variable and a an array that contains doubles, which also returns the average execution time.
I automated the part that calls my_OMP with increasing number of threads. Now, I would like to implement something similar using a function that is based on MPI. Again, by hand I would do:
$mpirun -np <number_of_threads> ./my_mpi,
which would return me the average computation time and then I would increase the number of threads.
So, instead of calling my_mpi several times by hand and noting down the return value, I'd like to automate this, i.e. I'd like to have a loop that increases the number of threads and stores the return values in an array.
How can I implement that?
I thought about running the program once with $mpirun -np MAX_NUMBER_OF_PARALLEL_THREADS ./my_mpi and then limiting the number of parallel threads to 1 for the serial and OMP part and increasing the limit of parallel threads in a loop for the actual MPI part, but I don't know how to to this.

You're misunderstanding how MPI works. MPI is process bases, so mpirun -n 27 yourprogram starts 27 instances of your executable, and they communicate through OS means. Thus there is no "serial part": each process executes every single statement in your source.
What you could do is
give your timed routine a communicator argument
make a series of increasing sized subcommunicators, and
call your timed routine with those subcommunicators.

OpenMP impact on performance

I am trying to parallelize a script using openMP, but when I measure the execution time of it (using omp_get_thread_num) the results are preety odd,
if I set the number of threads to 2 it measures 4935 us
setting it to 1 takes around 1083 us
and removing every openmp directive turns that into only 9 us
Here's the part of the script I'm talking about (this loop is nested inside another one)
for(j=(i-1); j>=0;j--){
a=0;
#pragma omp parallel
{
#pragma omp single
{
if(arreglo[j]>y){
arreglo[j+2]=arreglo[j];
}
else if(arreglo[j]>x){
if(!flag[1]){
arreglo[j+2]=y;
flag[1]=1;
}
arreglo[j+1]=arreglo[j];
}
}
#pragma omp single
{
if(arreglo[j]<=x){
arreglo[j+1]=x;
flag[0]=1;
a=1;
}
}
#pragma omp barrier
}
if (a==1){break;}
}
What could be the cause of this differences? some sort of bottleneck, or it's just the added cost of sychronization ?

We are talking about a really short execution time, which can be easily affected by the environment used for the benchmark;
You are clearly using an input size that does not justify the overhead of the parallelism.;
Your current design only allows for 2 threads; no room for scaling;
Instead of using the single constructor, you might as well just statically divide those two code branches based upon the thread ID, you would save the overhead of the single constructor;
That last barrier is redundant since the #pragma omp parallel has already an implicit barrier at the of it.
Furthermore, your code just looks intrinsically sequential, and with the current design, the code is clearly not suitable for parallelism.
if i set the number of threads to 2 it measures 4935 us setting it to
1 takes around 1083 us and removing every openmp directive turns that
into only 9 us
With 2 threads you are paying all that synchronization overhead, with 1 thread you are paying the price of having the openMP there. Finally, without the parallelization, you just removed all that overhead, hence the lower execution time.
Btw you do not need to remove the OpenMP directives, just compile the code without -fopenmp flag, and the directives will be ignored.

C OpenMP : libgomp: Thread creation failed: Resource temporarily unavailable

I was trying to do a p basic project and It seems like I have run out of threads? Do you guys know how do can I fix the problem?
Here is the code:
int main()
{
omp_set_num_threads(2150);
#pragma omp parallel
{
printf("%d\n",omp_get_thread_num());
}
return 0;
}
and here is the global compiler setting I have written on "other compiler options" on CodeBlocks:
-fopenmp
I am getting the error of:
libgomp: Thread creation failed: Resource temporarily unavailable
I have seen similar threads on the site, but I have not got the answer or solution as of yet.
Specs:
Intel i5 6400
2x8GB ram
Windows 10 64 bit

The problem is
omp_set_num_threads(2150);
The OS will impose a limit on the number of threads a process can create. This may be indirect for example by limiting the stack size. Creating 2150 exceeds the limits.
You mention that you've got the intel 15 6400 which is a quad core chip. Try setting the number of threads to something more reasonable. In your case:
omp_set_num_threads(4);
For numerical processing, performance will likely suffer when using more than 4 threads on a 4 core system

What is the equivalent of OpenMP Tasks in Pthreads in this recursion example?

I am studying parallel programming and I use the following OpenMP directive to parallelize a recursive function:
voir recursiveFunction()
{
//sequential code
#pragma omp task
{
recursiveFunction(); //First instance
} //Independent from each other,
//they allow an embarrassingly parallel strategy
recursiveFunction(); //Second instance
}
It works good enough, but I have a hard time trying to make an equivalent parallelization using only pthreads.
I was thinking something like this:
voir recursiveFunction()
{
//sequential code
Pthread_t thread;
//First instance
pthread_create(thread, NULL, recursiveFunction, recFuncStructParameter);
//Second instance
recursiveFunction();
}
And... I am kind of lost here... I can not grasp how to control the number of threads, if for example I want only 16 threads to by created and if all of them are "busy" then continue sequentially until one of them is freed, then do parallel again.
Could someone point me in the right direction? I have seen may examples which seem really complicated but I have a feeling that in this particular example, which allows an embarrassingly parallel strategy, there is a simle approach which I am unable to point out...

In openMP how do I ensure threads are synchronized before continuing?

I am using a #pragma omp barrier to ensure that all my parallel threads meet up at the same point before continuing (no fancy conditionally branching code, just straight loop), but I am surmising that the barrier pragma does not actually guarantee synchronicity, just completion as these are the results I am getting:
0: func() size: 64 Time: 0.000414 Start: 1522116688.801262 End: 1522116688.801676
1: func() size: 64 Time: 0.000828 Start: 1522116688.801263 End: 1522116688.802091
thread 0 is starting about a microsecond faster than thread 1, giving it the somewhat unrealistic completion time of 0.414 msec, incidentally in a single core/thread run the run time averages around 0.800 msec. (please forgive me if my units are off, it is late).
My Question is: Is there a way to ensure in openMP that threads are all started at the same time? Or would I have to bring in another library like pthread in order to have this functionality?

The barrier statement in OpenMP, as in other languages, ensures no thread progresses until all threads reach the barrier.
It does not specify the order in which threads begin to execute again. As far as I know manually scheduling threads is not possible in OpenMP or Pthread libraries(see comment below).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

MPI calls are slow in an OpenMP section - c

Related

Is there an analog of `mpirun -n` for a function/subroutine?

OpenMP impact on performance

C OpenMP : libgomp: Thread creation failed: Resource temporarily unavailable

What is the equivalent of OpenMP Tasks in Pthreads in this recursion example?

In openMP how do I ensure threads are synchronized before continuing?

Categories

Resources