In openMP how do I ensure threads are synchronized before continuing? - c

I am using a #pragma omp barrier to ensure that all my parallel threads meet up at the same point before continuing (no fancy conditionally branching code, just straight loop), but I am surmising that the barrier pragma does not actually guarantee synchronicity, just completion as these are the results I am getting:
0: func() size: 64 Time: 0.000414 Start: 1522116688.801262 End: 1522116688.801676
1: func() size: 64 Time: 0.000828 Start: 1522116688.801263 End: 1522116688.802091
thread 0 is starting about a microsecond faster than thread 1, giving it the somewhat unrealistic completion time of 0.414 msec, incidentally in a single core/thread run the run time averages around 0.800 msec. (please forgive me if my units are off, it is late).
My Question is: Is there a way to ensure in openMP that threads are all started at the same time? Or would I have to bring in another library like pthread in order to have this functionality?

The barrier statement in OpenMP, as in other languages, ensures no thread progresses until all threads reach the barrier.
It does not specify the order in which threads begin to execute again. As far as I know manually scheduling threads is not possible in OpenMP or Pthread libraries(see comment below).

Related

Is there an analog of `mpirun -n` for a function/subroutine?

I'd like to benchmark OMP and MPI. I already have a function which is properly implemented as a serial function, as a function using openmp, and as a function using MPI.
Now, I would like to benchmark the implementations and see how it scales with the number of threads.
I could do it by hand in the following way:
$./my_serial
which returns the average execution time for some computations,
then $./my_OMP <number of threads>, where I pass the number of threads to an expression #pragma omp parallel for private(j) shared(a) num_threads(nthreads) with j a running variable and a an array that contains doubles, which also returns the average execution time.
I automated the part that calls my_OMP with increasing number of threads. Now, I would like to implement something similar using a function that is based on MPI. Again, by hand I would do:
$mpirun -np <number_of_threads> ./my_mpi,
which would return me the average computation time and then I would increase the number of threads.
So, instead of calling my_mpi several times by hand and noting down the return value, I'd like to automate this, i.e. I'd like to have a loop that increases the number of threads and stores the return values in an array.
How can I implement that?
I thought about running the program once with $mpirun -np MAX_NUMBER_OF_PARALLEL_THREADS ./my_mpi and then limiting the number of parallel threads to 1 for the serial and OMP part and increasing the limit of parallel threads in a loop for the actual MPI part, but I don't know how to to this.
You're misunderstanding how MPI works. MPI is process bases, so mpirun -n 27 yourprogram starts 27 instances of your executable, and they communicate through OS means. Thus there is no "serial part": each process executes every single statement in your source.
What you could do is
give your timed routine a communicator argument
make a series of increasing sized subcommunicators, and
call your timed routine with those subcommunicators.

OpenMP impact on performance

I am trying to parallelize a script using openMP, but when I measure the execution time of it (using omp_get_thread_num) the results are preety odd,
if I set the number of threads to 2 it measures 4935 us
setting it to 1 takes around 1083 us
and removing every openmp directive turns that into only 9 us
Here's the part of the script I'm talking about (this loop is nested inside another one)
for(j=(i-1); j>=0;j--){
a=0;
#pragma omp parallel
{
#pragma omp single
{
if(arreglo[j]>y){
arreglo[j+2]=arreglo[j];
}
else if(arreglo[j]>x){
if(!flag[1]){
arreglo[j+2]=y;
flag[1]=1;
}
arreglo[j+1]=arreglo[j];
}
}
#pragma omp single
{
if(arreglo[j]<=x){
arreglo[j+1]=x;
flag[0]=1;
a=1;
}
}
#pragma omp barrier
}
if (a==1){break;}
}
What could be the cause of this differences? some sort of bottleneck, or it's just the added cost of sychronization ?
We are talking about a really short execution time, which can be easily affected by the environment used for the benchmark;
You are clearly using an input size that does not justify the overhead of the parallelism.;
Your current design only allows for 2 threads; no room for scaling;
Instead of using the single constructor, you might as well just statically divide those two code branches based upon the thread ID, you would save the overhead of the single constructor;
That last barrier is redundant since the #pragma omp parallel has already an implicit barrier at the of it.
Furthermore, your code just looks intrinsically sequential, and with the current design, the code is clearly not suitable for parallelism.
if i set the number of threads to 2 it measures 4935 us setting it to
1 takes around 1083 us and removing every openmp directive turns that
into only 9 us
With 2 threads you are paying all that synchronization overhead, with 1 thread you are paying the price of having the openMP there. Finally, without the parallelization, you just removed all that overhead, hence the lower execution time.
Btw you do not need to remove the OpenMP directives, just compile the code without -fopenmp flag, and the directives will be ignored.

MPI calls are slow in an OpenMP section

I am attempting to write a hybrid MPI + OpenMP linear solver within the PETSc framework. I am currently running this code on 2 nodes, with 2 sockets per node, and 8 cores per socket.
export OMP_MAX_THREADS=8
export KMP_AFFINITY=compact
mpirun -np 4 --bysocket --bind-to-socket ./program
I have checked that this gives me a nice NUMA-friendly thread distribution.
My MPI program creates 8 threads, 1 of which should perform MPI communications while the remaining 7 perform computations. Later, I may try to oversubscribe the sockets with 9 threads each.
I currently do it like this:
omp_set_nested(1);
#pragma omp parallel sections num_threads(2)
{
// COMMUNICATION THREAD
#pragma omp section
{
while(!stop)
{
// Vector Scatter with MPI Send/Recv
// Check stop criteria
}
}
// COMPUTATION THREAD(S)
#pragma omp section
{
while(!stop)
{
#pragma omp parallel for num_threads(7) schedule(static)
for (i = 0; i < n; i++)
{
// do some computation
}
}
}
}
My problem is that the MPI communications take an exceptional amount of time, just because I placed them in the OpenMP section. The vector scatter takes approximately 0.024 seconds inside the OpenMP section, and less than 0.0025 seconds (10 times faster) if it is done outside of the OpenMP parallel region.
My two theories are:
1) MPI/OpenMP is performing extra thread-locking to ensure my MPI calls are safe, even though its not needed. I have tried forcing MPI_THREAD_SINGLE, MPI_THREAD_FUNELLED and MPI_THREAD_MULTIPLE to see if I can convince MPI that its already safe, but this had no effect. Is there something I'm missing?
2) My computation thread updates values used by the communications (its actually a deliberate race condition - as if this wasn't awkward enough already!). It could be that I'm facing memory bottlenecks. It could also be that I'm facing cache thrashing, but I'm not forcing any OpenMP flushes, so I don't think its that.
As a bonus question: is an OpenMP flush operation clever enough to only flush to the shared cache if all the threads are on the same socket?
Additional Information: The vector scatter is done with the PETSc functions VecScatterBegin() and VecScatterEnd(). A "raw" MPI implementation may not have these problems, but its a lot of work to re-implement the vector scatter to find out, and I'd rather not do that yet. From what I can tell, its an efficient loop of MPI Send/Irecvs beneath the surface.

How to block a thread into nop or low power status instead of switch itself outof processor

I'm writing an user-space program, in which I want to "block" a thread at some time. If I use mutex like function, the thread would be switched out of the processor. What I want is to let the thread keep on the proccessor without context switch invovled in a low power status or loop nop operations. And at some time, another thread could "wake" it up or "unblock" from the nop or low power status and continue execution. Is there any function or lib I can use?
It's good and valid question. Spin-loops like what you describe can use pause instruction in order to enable paired hyper-thread with more resources and enable power-saving optimizations.
E.g.
while(condition) _mm_pause();
If evaluating the condition consumes more resources than necessary, repeat the pause several times. E.g. tbb::spin_mutex uses a backoff algorithm where each failed condition check leads to doubling the number of pause iterations before the next evaluation.
See also this blog.

Pseudo Code for Producer Consumer Synchronization

I'm having some trouble writing Pseduocode for a homework assignment in my operating systems class in which we are programming in C.
You will be implementing a Producer-Consumer program with a bounded buffer queue of N elements, P producer threads and C consumer threads
(N, P and C should be command line arguments to your program, along with three additional parameters, X, Ptime and Ctime, that are described below). Each
Producer thread should Enqueue X different numbers onto the queue (spin-waiting for Ptime*100,000 cycles in between each call to Enqueue). Each Consumer thread
should Dequeue P*X/C items from the queue (spin-waiting for Ctime*100,000 cycles
in between each call to Dequeue). The main program should create/initialize the
Bounded Buffer Queue, print a timestamp, spawn off C consumer threads & P
producer threads, wait for all of the threads to finish and then print off another
timestamp & the duration of execution.
My main difficulty is understanding what my professor means by spin-waiting for the variables times 100,000. I have bolded the section that is confusing me.
I understand a time stamp will be used to print the difference between each thread. We are using semaphores and implementing synchronization at the moment. Any suggestions on the above queries would be much appreciated.
I'm guessing it means busy-waiting; repeatedly checking the loop condition and consuming unnecessary CPU power in a tight loop:
while (current_time() <= wake_up_time);
One would ideally use something that suspends your thread until it's woken up externally, by the scheduler (so resources such as the CPU can be diverted elsewhere):
sleep(2 * 60 * 1000 ms);
or at least give up some CPU (i.e. not be so tight):
while (current_time() <= wake_up_time)
sleep(100 ms);
But I guess they don't want you to manually invoke the scheduler, hinting the OS (or your threading library) that it's a good time to make a context switch.
I'm not sure what cycles are; in assembly they might be CPU cycles but given that your question is tagged C, I'll bet that they're simply loop iterations:
for (int i=0; i<Ptime*100000; ++i); //spin-wait for Ptime*100,000 cycles
Though it's always safest to ask whoever issued the homework.
busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input is available, or if a lock is available.
so the assignment says to wait for Ptime*100000 time before producing next element and enqueue x different elements after the condition is true
similarly Each Consumer thread
should Dequeue P*X/C items from the queue and wait for ctime*100000 after every consumption of item
I suspect that your professor is being a complete putz - by actually ASKING for the worste "busy waiting" technique in existance:
int n = pTime * 100000;
for ( int i=0; i<n; ++i) ; // waste some cycles.
I also suspect that he still uses a pterosaur thigh-bone as a walking stick, has a very nice (dry) cave, and a partner with a large bald patch.... O/S guys tend to be that way. It goes with the cool beards.
No wonder his thoroughly modern students misunderstand him. He needs to (re)learn how to grunt IN TUNE.
Cheers. Keith.

Resources