I have a code that I want to optimise that should run in a variaty of threads ammount. After running some tests using different scheduling techniques in a for loop that I have, I came to the conclusion that what suits best is to perform a dynamic scheduling when I have only one thread and guided otherwise. Is that even possible in openMP?
To be more precise I want to be able to do something like the following:
if(omp_get_max_threads()>1)
#pragma omp parallel for .... scheduling(guided)
else
#pragma omp parallel for .... scheduling(dynamic)
for(.....){
...
}
If anyone can help me I would appreciate it. The other solution would be to write two times the for loop and use an if condition. But I want to avoid that if it is possible.
Possible solution is to copy the loop into an if statement and to "extract" loop body into function to avoid breaking DRY principle. Then there will be only one place where you have to change this code if you need to change it in the future:
void foo(....)
{
...
}
if(omp_get_max_threads()>1)
{
#pragma omp parallel for .... scheduling(guided)
for (.....)
foo(....);
}
else
{
#pragma omp parallel for .... scheduling(dynamic)
for (.....)
foo(....);
}
Related
I am trying to parallelize a script using openMP, but when I measure the execution time of it (using omp_get_thread_num) the results are preety odd,
if I set the number of threads to 2 it measures 4935 us
setting it to 1 takes around 1083 us
and removing every openmp directive turns that into only 9 us
Here's the part of the script I'm talking about (this loop is nested inside another one)
for(j=(i-1); j>=0;j--){
a=0;
#pragma omp parallel
{
#pragma omp single
{
if(arreglo[j]>y){
arreglo[j+2]=arreglo[j];
}
else if(arreglo[j]>x){
if(!flag[1]){
arreglo[j+2]=y;
flag[1]=1;
}
arreglo[j+1]=arreglo[j];
}
}
#pragma omp single
{
if(arreglo[j]<=x){
arreglo[j+1]=x;
flag[0]=1;
a=1;
}
}
#pragma omp barrier
}
if (a==1){break;}
}
What could be the cause of this differences? some sort of bottleneck, or it's just the added cost of sychronization ?
We are talking about a really short execution time, which can be easily affected by the environment used for the benchmark;
You are clearly using an input size that does not justify the overhead of the parallelism.;
Your current design only allows for 2 threads; no room for scaling;
Instead of using the single constructor, you might as well just statically divide those two code branches based upon the thread ID, you would save the overhead of the single constructor;
That last barrier is redundant since the #pragma omp parallel has already an implicit barrier at the of it.
Furthermore, your code just looks intrinsically sequential, and with the current design, the code is clearly not suitable for parallelism.
if i set the number of threads to 2 it measures 4935 us setting it to
1 takes around 1083 us and removing every openmp directive turns that
into only 9 us
With 2 threads you are paying all that synchronization overhead, with 1 thread you are paying the price of having the openMP there. Finally, without the parallelization, you just removed all that overhead, hence the lower execution time.
Btw you do not need to remove the OpenMP directives, just compile the code without -fopenmp flag, and the directives will be ignored.
I am studying parallel programming and I use the following OpenMP directive to parallelize a recursive function:
voir recursiveFunction()
{
//sequential code
#pragma omp task
{
recursiveFunction(); //First instance
} //Independent from each other,
//they allow an embarrassingly parallel strategy
recursiveFunction(); //Second instance
}
It works good enough, but I have a hard time trying to make an equivalent parallelization using only pthreads.
I was thinking something like this:
voir recursiveFunction()
{
//sequential code
Pthread_t thread;
//First instance
pthread_create(thread, NULL, recursiveFunction, recFuncStructParameter);
//Second instance
recursiveFunction();
}
And... I am kind of lost here... I can not grasp how to control the number of threads, if for example I want only 16 threads to by created and if all of them are "busy" then continue sequentially until one of them is freed, then do parallel again.
Could someone point me in the right direction? I have seen may examples which seem really complicated but I have a feeling that in this particular example, which allows an embarrassingly parallel strategy, there is a simle approach which I am unable to point out...
Hi I am new to parallel programming and while reading about it I came across a code template in C , Can you please explain me what this lines mean,line by line???
#include <omp.h>
main () {
int var1, var2, var3;
Serial code
.
.
.
Beginning of parallel section. Fork a team of threads.
Specify variable scoping
#pragma omp parallel private(var1, var2) shared(var3)
{
Parallel section executed by all threads
.
.
.
All threads join master thread and disband
}
Resume serial code
.
.
.
}
First, I'd like to say this is a bad place to ask question because you've obviously not done any research in to the matter yourself (especially as this template is pretty self explanatory).
I'll explain it briefly for you however:
#include <omp.h>
Allows access to the openmp library so you can use all of its functions.
main () {
Please tell me you understand this line? It's very bad practice to not return an int here however, you should really have an int main.
int var1, var2, var3;
Define 3 integers.
Where serial code is written, you can just read normal code here all executing on one thread/processor.
#pragma omp parallel private(var1, var2) shared(var3)
This line is perhaps the most important. It basically says the code in the next set of { } can be executed in parallel. private(var1, var2) means that each thread will get its own copy of these variables (I.e. thread 1's var1 is not the same as thread 2's var1) and shared(var3) means that var3 is the same on all threads (and if it changes on thread 1, it also changes on thread 2)
The code executes in parallel until that } is reached at which point the code returns to normal operating mode if you like on one thread.
You should really read some basic OMP tutorials which you can find anywhere on the internet with a very simple google.
I hope this gets you started though.
I am attempting to write a hybrid MPI + OpenMP linear solver within the PETSc framework. I am currently running this code on 2 nodes, with 2 sockets per node, and 8 cores per socket.
export OMP_MAX_THREADS=8
export KMP_AFFINITY=compact
mpirun -np 4 --bysocket --bind-to-socket ./program
I have checked that this gives me a nice NUMA-friendly thread distribution.
My MPI program creates 8 threads, 1 of which should perform MPI communications while the remaining 7 perform computations. Later, I may try to oversubscribe the sockets with 9 threads each.
I currently do it like this:
omp_set_nested(1);
#pragma omp parallel sections num_threads(2)
{
// COMMUNICATION THREAD
#pragma omp section
{
while(!stop)
{
// Vector Scatter with MPI Send/Recv
// Check stop criteria
}
}
// COMPUTATION THREAD(S)
#pragma omp section
{
while(!stop)
{
#pragma omp parallel for num_threads(7) schedule(static)
for (i = 0; i < n; i++)
{
// do some computation
}
}
}
}
My problem is that the MPI communications take an exceptional amount of time, just because I placed them in the OpenMP section. The vector scatter takes approximately 0.024 seconds inside the OpenMP section, and less than 0.0025 seconds (10 times faster) if it is done outside of the OpenMP parallel region.
My two theories are:
1) MPI/OpenMP is performing extra thread-locking to ensure my MPI calls are safe, even though its not needed. I have tried forcing MPI_THREAD_SINGLE, MPI_THREAD_FUNELLED and MPI_THREAD_MULTIPLE to see if I can convince MPI that its already safe, but this had no effect. Is there something I'm missing?
2) My computation thread updates values used by the communications (its actually a deliberate race condition - as if this wasn't awkward enough already!). It could be that I'm facing memory bottlenecks. It could also be that I'm facing cache thrashing, but I'm not forcing any OpenMP flushes, so I don't think its that.
As a bonus question: is an OpenMP flush operation clever enough to only flush to the shared cache if all the threads are on the same socket?
Additional Information: The vector scatter is done with the PETSc functions VecScatterBegin() and VecScatterEnd(). A "raw" MPI implementation may not have these problems, but its a lot of work to re-implement the vector scatter to find out, and I'd rather not do that yet. From what I can tell, its an efficient loop of MPI Send/Irecvs beneath the surface.
I have the following code:
#pragma omp parallel sections private(x,y,cpsrcptr) firstprivate(srcptr) lastprivate(srcptr)
{
#pragma omp section
{
//stuff
}
#pragma omp section
{
//stuff
}
}
According to the Zoom profiler, two threads are created, one thread executes both the sections, and the other thread simply blocks!
Has anyone encountered anything like this before? (And yes, I do have a dual core machine).
I guess I don't know too much about profilers yet, but one problem I've run into is forgetting to use the OpenMP flag and enable support.
Alternatively, what if you just created a simple application to try to verify the threads?
#pragma omp parallel num_threads(2)
{
#pragma omp critical
std::cout << "hello from thread: " << omp_get_thread_num() << "\n" << std::endl;
}
Maybe see if that works?
No, I can't say that I have encountered anything quite like this before. I have encountered a variety of problems with OpenMP codes though.
I can't see anything immediately wrong with your code snippet. When you use the Zoom profiler it affects the execution of the program. Have you checked that, outside the profiler, the program runs the sections on different threads ? If you have more sections do they all run on the same thread or do they run on different threads ? If you only have two sections, add some dummy ones while you test this.