OpenMP impact on performance - c

I am trying to parallelize a script using openMP, but when I measure the execution time of it (using omp_get_thread_num) the results are preety odd,
if I set the number of threads to 2 it measures 4935 us
setting it to 1 takes around 1083 us
and removing every openmp directive turns that into only 9 us
Here's the part of the script I'm talking about (this loop is nested inside another one)
for(j=(i-1); j>=0;j--){
a=0;
#pragma omp parallel
{
#pragma omp single
{
if(arreglo[j]>y){
arreglo[j+2]=arreglo[j];
}
else if(arreglo[j]>x){
if(!flag[1]){
arreglo[j+2]=y;
flag[1]=1;
}
arreglo[j+1]=arreglo[j];
}
}
#pragma omp single
{
if(arreglo[j]<=x){
arreglo[j+1]=x;
flag[0]=1;
a=1;
}
}
#pragma omp barrier
}
if (a==1){break;}
}
What could be the cause of this differences? some sort of bottleneck, or it's just the added cost of sychronization ?

We are talking about a really short execution time, which can be easily affected by the environment used for the benchmark;
You are clearly using an input size that does not justify the overhead of the parallelism.;
Your current design only allows for 2 threads; no room for scaling;
Instead of using the single constructor, you might as well just statically divide those two code branches based upon the thread ID, you would save the overhead of the single constructor;
That last barrier is redundant since the #pragma omp parallel has already an implicit barrier at the of it.
Furthermore, your code just looks intrinsically sequential, and with the current design, the code is clearly not suitable for parallelism.
if i set the number of threads to 2 it measures 4935 us setting it to
1 takes around 1083 us and removing every openmp directive turns that
into only 9 us
With 2 threads you are paying all that synchronization overhead, with 1 thread you are paying the price of having the openMP there. Finally, without the parallelization, you just removed all that overhead, hence the lower execution time.
Btw you do not need to remove the OpenMP directives, just compile the code without -fopenmp flag, and the directives will be ignored.

Related

Is there an analog of `mpirun -n` for a function/subroutine?

I'd like to benchmark OMP and MPI. I already have a function which is properly implemented as a serial function, as a function using openmp, and as a function using MPI.
Now, I would like to benchmark the implementations and see how it scales with the number of threads.
I could do it by hand in the following way:
$./my_serial
which returns the average execution time for some computations,
then $./my_OMP <number of threads>, where I pass the number of threads to an expression #pragma omp parallel for private(j) shared(a) num_threads(nthreads) with j a running variable and a an array that contains doubles, which also returns the average execution time.
I automated the part that calls my_OMP with increasing number of threads. Now, I would like to implement something similar using a function that is based on MPI. Again, by hand I would do:
$mpirun -np <number_of_threads> ./my_mpi,
which would return me the average computation time and then I would increase the number of threads.
So, instead of calling my_mpi several times by hand and noting down the return value, I'd like to automate this, i.e. I'd like to have a loop that increases the number of threads and stores the return values in an array.
How can I implement that?
I thought about running the program once with $mpirun -np MAX_NUMBER_OF_PARALLEL_THREADS ./my_mpi and then limiting the number of parallel threads to 1 for the serial and OMP part and increasing the limit of parallel threads in a loop for the actual MPI part, but I don't know how to to this.
You're misunderstanding how MPI works. MPI is process bases, so mpirun -n 27 yourprogram starts 27 instances of your executable, and they communicate through OS means. Thus there is no "serial part": each process executes every single statement in your source.
What you could do is
give your timed routine a communicator argument
make a series of increasing sized subcommunicators, and
call your timed routine with those subcommunicators.

In openMP how do I ensure threads are synchronized before continuing?

I am using a #pragma omp barrier to ensure that all my parallel threads meet up at the same point before continuing (no fancy conditionally branching code, just straight loop), but I am surmising that the barrier pragma does not actually guarantee synchronicity, just completion as these are the results I am getting:
0: func() size: 64 Time: 0.000414 Start: 1522116688.801262 End: 1522116688.801676
1: func() size: 64 Time: 0.000828 Start: 1522116688.801263 End: 1522116688.802091
thread 0 is starting about a microsecond faster than thread 1, giving it the somewhat unrealistic completion time of 0.414 msec, incidentally in a single core/thread run the run time averages around 0.800 msec. (please forgive me if my units are off, it is late).
My Question is: Is there a way to ensure in openMP that threads are all started at the same time? Or would I have to bring in another library like pthread in order to have this functionality?
The barrier statement in OpenMP, as in other languages, ensures no thread progresses until all threads reach the barrier.
It does not specify the order in which threads begin to execute again. As far as I know manually scheduling threads is not possible in OpenMP or Pthread libraries(see comment below).

MPI calls are slow in an OpenMP section

I am attempting to write a hybrid MPI + OpenMP linear solver within the PETSc framework. I am currently running this code on 2 nodes, with 2 sockets per node, and 8 cores per socket.
export OMP_MAX_THREADS=8
export KMP_AFFINITY=compact
mpirun -np 4 --bysocket --bind-to-socket ./program
I have checked that this gives me a nice NUMA-friendly thread distribution.
My MPI program creates 8 threads, 1 of which should perform MPI communications while the remaining 7 perform computations. Later, I may try to oversubscribe the sockets with 9 threads each.
I currently do it like this:
omp_set_nested(1);
#pragma omp parallel sections num_threads(2)
{
// COMMUNICATION THREAD
#pragma omp section
{
while(!stop)
{
// Vector Scatter with MPI Send/Recv
// Check stop criteria
}
}
// COMPUTATION THREAD(S)
#pragma omp section
{
while(!stop)
{
#pragma omp parallel for num_threads(7) schedule(static)
for (i = 0; i < n; i++)
{
// do some computation
}
}
}
}
My problem is that the MPI communications take an exceptional amount of time, just because I placed them in the OpenMP section. The vector scatter takes approximately 0.024 seconds inside the OpenMP section, and less than 0.0025 seconds (10 times faster) if it is done outside of the OpenMP parallel region.
My two theories are:
1) MPI/OpenMP is performing extra thread-locking to ensure my MPI calls are safe, even though its not needed. I have tried forcing MPI_THREAD_SINGLE, MPI_THREAD_FUNELLED and MPI_THREAD_MULTIPLE to see if I can convince MPI that its already safe, but this had no effect. Is there something I'm missing?
2) My computation thread updates values used by the communications (its actually a deliberate race condition - as if this wasn't awkward enough already!). It could be that I'm facing memory bottlenecks. It could also be that I'm facing cache thrashing, but I'm not forcing any OpenMP flushes, so I don't think its that.
As a bonus question: is an OpenMP flush operation clever enough to only flush to the shared cache if all the threads are on the same socket?
Additional Information: The vector scatter is done with the PETSc functions VecScatterBegin() and VecScatterEnd(). A "raw" MPI implementation may not have these problems, but its a lot of work to re-implement the vector scatter to find out, and I'd rather not do that yet. From what I can tell, its an efficient loop of MPI Send/Irecvs beneath the surface.

cost on blocked operation was increased by the number of thread

I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.
There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.
Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.
This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(

Choose OpenMP pragma according to condition

I have a code that I want to optimise that should run in a variaty of threads ammount. After running some tests using different scheduling techniques in a for loop that I have, I came to the conclusion that what suits best is to perform a dynamic scheduling when I have only one thread and guided otherwise. Is that even possible in openMP?
To be more precise I want to be able to do something like the following:
if(omp_get_max_threads()>1)
#pragma omp parallel for .... scheduling(guided)
else
#pragma omp parallel for .... scheduling(dynamic)
for(.....){
...
}
If anyone can help me I would appreciate it. The other solution would be to write two times the for loop and use an if condition. But I want to avoid that if it is possible.
Possible solution is to copy the loop into an if statement and to "extract" loop body into function to avoid breaking DRY principle. Then there will be only one place where you have to change this code if you need to change it in the future:
void foo(....)
{
...
}
if(omp_get_max_threads()>1)
{
#pragma omp parallel for .... scheduling(guided)
for (.....)
foo(....);
}
else
{
#pragma omp parallel for .... scheduling(dynamic)
for (.....)
foo(....);
}

Resources