I have the following piece of code (note: this is to help me understand the concept so it is an example code not something I intend to run)
List ml; //my_list
Element *e;
#pragma omp parallel
#pragma omp single
{
for(e=ml->first;e;e=e->next)
#pragma omp task
process(e);//random function
}
It was mentioned that e will always have the same value, I am trying to think why is that and what the value will be.
My try/reasoning:
The value of e is changing inside pragma omp single, these changes don't get carried out inside task (If I am not mistaken no value of the single region get carried to inside the task unless I use something like firstprivate(e), moreover, the value it will take will be random since we didn't initialize e to any variable outside the single and parallel omp region and that what will e take as a value, if we had initialized outside to a value x for example then e will always be x
Any help correcting or verifying my reasoning would be appreciated.
I have added some comments to your example to hopefully make the behavior a bit clearer.
List ml; //my_list
Element *e;
#pragma omp parallel // e is shared for the parallel region
#pragma omp single
{
for(e=ml->first;e;e=e->next)
#pragma omp task // since e is shared, all tasks will see the "same" e
process(e);
}
What happens is this (indicated by the comments above): you're declaring e outside of the scope of the parallel constructs. As per the OpenMP specification, the variable will be shared across all the threads executing. The single constructs then restricts execution to any one thread of the team (e is still shared across all the threads, see https://www.openmp.org/spec-html/5.1/openmpsu113.html#x148-1600002.21.1).
When the picked thread encounters the task construct, the OpenMP specification mandates that the created task from the task inherits the sharing attributes of the e variable (shared), so all created tasks will see the same variable and the picked thread may overwrite the e variable while it executes the for loop.
That's where the firstprivate(e) comes in:
List ml; //my_list
Element *e;
#pragma omp parallel // e is shared for the parallel region
#pragma omp single
{
for(e=ml->first;e;e=e->next)
#pragma omp task firstprivate(e) // task now receives a private "copy" of e
process(e);
}
Here, the create tasks will have a private copy of e that is initialized with the current value of e as the picked thread progresses through the for loop.
Another way to fix this would be this:
List ml; //my_list
#pragma omp parallel
#pragma omp single
{
Element *e; // e is thread-private inside the parallel region
for(e=ml->first;e;e=e->next)
#pragma omp task // task now receives a private "copy" of e w/o firstprivate
process(e);
}
Since in this example, the OpenMP specification mandates that the variable should be treated as if you specified firstprivate(e) (see https://www.openmp.org/spec-html/5.1/openmpsu113.html#x148-1610002.21.1.1).
Related
The result often times looks wrong, because the 'bmmin' after the parallelization seems to be wrong or something like that..
#pragma omp parallel private(thread_id, bmmin, r ,t, am, b, bm)
{
thread_id=omp_get_thread_num();
bmmin=INFINITY;
for (i=0; i<nel; i++) {
am=a[i]+ldigitse*j;
b=roundl((lval-am)/ldigits0);
bm=fabsl(am+b*ldigits0-lval);
if (bm<bmmin) {
bmmin=bm;
t[0]=(int)b;
r=ldigits[0]*t[0];
for (l=1; l<ndig; l++) {
t[l]=(*s)[i][l-1];
r=r+ldigits[l]*t[l];
};
t[ndig]=j;
r=r+ldigits[ndig]*t[ndig];
};
};
// bmmin result looks almost same in many threads, why?
printf("Thread %d: r=%Lg, bmmin=%Lg, bmmin_glob=%Lg\n",thread_id,powl(10,r),bmmin,bmmin_glob);
#pragma omp critical
if (bmmin<bmmin_glob) {
printf("Thread %d - giving minimum r=%9Lg!\n",thread_id,powl(10,r));
bmmin_glob=bmmin;
r_glob=r;
for (i=0; i<=ndig; i++) {
t_glob[i]=t[i];
};
};
};
When running the code, it outputs as:
Initializing the table of the logarithmic constants...
Calculation started for k from 0 to 38...
j,k=-19,0
Thread 7: r=2.57008e+30, bmmin=2.96034e-05, bmmin_glob=inf
Thread 7 - giving minimum r=2.57008e+30!
Thread 1: r=3.74482e+16, bmmin=2.96034e-05, bmmin_glob=inf
Thread 6: r=3.74482e+16, bmmin=2.96034e-05, bmmin_glob=inf
Thread 3: r=3.1399, bmmin=0.000234018, bmmin_glob=inf
Thread 2: r=3.74482e+16, bmmin=2.96034e-05, bmmin_glob=inf
Thread 5: r=3.1399, bmmin=0.000234018, bmmin_glob=inf
Thread 4: r=392.801, bmmin=0.000113243, bmmin_glob=inf
Thread 0: r=3.14138, bmmin=2.96034e-05, bmmin_glob=2.96034e-05
Result: 2.57008e+30
Exponents: 2^129*3^-13*5^16*7^-19
j,k=-18,1
with a lot of case that have bmmin=2.96034e-05, even the r-value has a lot of variation.
bmmin result looks almost same in many threads, why?
This is because it is defined as a private variable in the parallel section in the code. In fact the same thing applies for thread_id and other variables like r. A private variable is a variable defined and accessible only from each thread. If you want to make accessible the result of each thread to the main thread, then you need to store the value in an array. Alternatively you can use OpenMP reductions.
[...] looks like the 'i' values are out of the range of the for loop
Variable are implicitly shared by default in parallel sections. This means i is shared by default. Thus, there is a race condition on i. You need to put it private or to declare it inside the parallel section so each thread have its own version.
Note that omp parallel section does not share the work between threads. You need to either use a parallel for or to do it yourself (eg. splitting nel so each thread compute a part of the loop if this is what you want).
Besides this, #pragma omp critical does nothing outside a parallel section. It might be useful to use two directives: a #pragma omp for directive to a #pragma omp parallel and a #pragma omp for ones.
I am trying to optimize a code about image processing with OpenMP. I am still learning it and I came to the classic parallel for loop which is slower than my single threaded implementation.
The for loop I tried to parallelize has only 50 iterations and I saw it oculd explain why the OpenMP use in this case could be useless due to overhead operations.
But is it possible to evaluate these costs ? What are the cost differences between shared / private / firstprivate (...) clauses ?
This is the for loop I want to parallelize:
#pragma omp parallel for \
shared(img_in, img_height, img_width, w, update_gauss, MoGInitparameters, nb_motion_bloc) \
firstprivate(img_fg, gaussStruct) \
private(ContrastHisto, j, x, y) \
schedule(dynamic, 1) \
num_threads(max_threads)
for(i = 0; i<h; i++){ // h=50
for(j = 0; j<w; j++){
ComputeContrastHistogram_generic1D_rect(img_in, H_DESC_STEP*i, W_DESC_STEP*j, ContrastHisto, H_DESC_SIZE, W_DESC_SIZE, 2, img_height, img_width);
x = i;
y = w - 1 - j;
img_fg->data[y][x] = 1-MatchMoG_GaussianInt(ContrastHisto, gaussStruct->gauss[i*w+j], update_gauss, MoGInitparameters);;
#pragma omp critical
{
nb_motion_bloc += img_fg->data[y][x];
nb_motion_bloc += img_fg->data[y][x];
}
}
}
Maybe I am doing some mistakes but if that's the case please tell me why !!
Several points that don't address your specific questions.
Try to avoid #pragma omp critical. In this case, you can remove it altogether by adding a reduction(+:nb_motion_bloc ) clause to the omp parallel for line and using nb_motion_bloc += 2*img_fg->data[y][x];.
Depending on how much work each iteration has to do, short loops incur (much) more overhead than they're worth.
Now to the questions. If you don't change any of variables, don't bother classifying them as shared/private/firstprivate. If they are supposed to be used by each thread and discarded, you can use a construct like
#pragma omp parallel
{
int x, y;
#pragma omp for
for(i = 0; i<h; i++)
{
...
}
}
If the workload is balanced, then consider using schedule(static).
As to the differences between shared/private/firstprivate see this and this questions. From wikipedia:
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
default: allows the programmer to state that the default data scoping within a parallel region will be either shared, or none for C/C++, or shared, firstprivate, private, or none for Fortran. The none option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.
firstprivate: like private except initialized to original value.
lastprivate: like private except original value is updated after construct.
I am currently working on a matrix computation with OpenMP. I have several loops in my code, and instead on calling for each loop #pragma omp parallel for[...] (which create all the threads and destroy them right after) I would like to create all of them at the beginning, and delete them at the end of the program in order to avoid overhead.
I want something like :
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp for[...]
for(...)
}
The problem is that I have some parts those have to be execute by only one thread, but in a loop, which contains loops those have to be execute in parallel... This is how it looks:
//have to be execute by only one thread
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp parallel for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
(The loop boundaries are quite small in my example. In my program the number of iterations can goes until 20.000...)
One of my first idea was to do something like this:
//have to be execute by only one thread
#pragma omp parallel //creating all the threads at the beginning
{
#pragma omp master //or single
{
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
}
} //deleting all the threads
It doesn't compile, I get this error from gcc: "work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region".
I know it surely comes from the "wrong" nesting, but I can't understand why it doesn't work. Do I need to add a barrier before the parallel zone ? I am a bit lost and don't know how to solve it.
Thank you in advance for your help.
Cheers.
Most OpenMP runtimes don't "create all the threads and destroy them right after". The threads are created at the beginning of the first OpenMP section and destroyed when the program terminates (at least that's how Intel's OpenMP implementation does it). There's no performance advantage from using one big parallel region instead of several smaller ones.
Intel's runtimes (which is open source and can be found here) has options to control what threads do when they run out of work. By default they'll spin for a while (in case the program immediately starts a new parallel section), then they'll put themselves to sleep. If the do sleep, it will take a bit longer to start them up for the next parallel section, but this depends on the time between regions, not the syntax.
In the last of your code outlines you declare a parallel region, inside that use a master directive to ensure that only the master thread executes a block, and inside the master block attempt to parallelise a loop across all threads. You claim to know that the compiler errors arise from incorrect nesting but wonder why it doesn't work.
It doesn't work because distributing work to multiple threads within a region of code which only one thread will execute doesn't make any sense.
Your first pseudo-code is better, but you probably want to extend it like this:
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp single
{ ... }
#pragma omp for[...]
for(...)
}
The single directive ensures that the block of code it encloses is only executed by one thread. Unlike the master directive single also implies a barrier at exit; you can change this behaviour with the nowait clause.
I have this parallel for loop
struct p
{
int n;
double *l;
}
#pragma omp parallel for default(none) private(i) shared(p)
for (i = 0; i < p.n; ++i)
{
DoSomething(p, i);
}
Now, it is possible that inside DoSomething(), p.n is increased because new elements are added to p.l. I'd like to process these elements in a parallel fashion. OpenMP manual states that parallel for can't be used with lists, so DoSomething() adds these p.l's new elements to another list which is processed sequentially and then it is joined back with p.l. I don't like this workaround. Anyone knows a cleaner way to do this?
A construct to support dynamic execution was added to OpenMP 3.0 and it is the task construct. Tasks are added to a queue and then executed as concurrently as possible. A sample code would look like this:
#pragma omp parallel private(i)
{
#pragma omp single
for (i = 0; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
}
This will spawn a new parallel region. One of the threads will execute the for loop and create a new OpenMP task for each value of i. Each different DoSomething() call will be converted to a task and will later execute inside an idle thread. There is a problem though: if one of the tasks add new values to p.l, it might happen after the creator thread has already exited the for loop. This could be fixed using task synchronisation constructs and an outer loop like this:
#pragma omp single
{
i = 0;
while (i < p.n)
{
for (; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
#pragma omp taskwait
#pragma omp flush
}
}
The taskwait construct makes for the thread to wait until all queued tasks are executed. If new elements were added to the list, the condition of the while would become true again and a new round of tasks creation will happen. The flush construct is supposed to synchronise the memory view between threads and e.g. update optimised register variables with the value from the shared storage.
OpenMP 3.0 is supported by all modern C compilers except MSVC, which is stuck at OpenMP 2.0.
Why am I getting this error, and what should I do?
error: firstprivate variable 'j' is private in outer context
void foo() {
int i;
int j = 10;
#pragma omp for firstprivate(j)
for (i = 0; i < 10; i++)
printf("%d\n", j);
}
It works if you use the pragma
#pragma omp parallel for firstprivate(j)
Note that omp for and omp parallel for aren't the same thing: the latter is shorthand for an omp for inside an omp parallel.
I deleted my first answer because I missed something and it was incorrect. The error is correct because of a restriction in the OpenMP V3.0 spec (and previous versions), section 2.9.3.4 firstprivate clause, Restrictions bullet 2:
• A list item that is private within a parallel region must not appear in a
firstprivate clause on a worksharing construct if any of the worksharing
regions arising from the worksharing construct ever bind to any of the parallel
regions arising from the parallel construct.
The problem is that it doesn't know which private value to use among the threads that are to execute the worksharing region. If it is a new parallel region, then each thread will create a new region and the firstprivate is copied from the private copy of the thread creating the region.