Why does the result of #pragma omp parallel look like this - c

#include<stdio.h>
int main()
{
int a=0, b=0;
#pragma omp parallel num_threads(16)
{
// #pragma omp single
a++;
// #pragma omp critical
b++;
}
printf("single: %d -- critical: %d\n", a, b);
}
Why is my output single: 5 -- critical: 5 here?
And why is it output single: 3 -- critical: 3 when num_threads(4)?
I should not code anything like this, right? If the threads are confused here (I guess), why the result is consistent?

You have race conditions, therefore the values of a and b are undefined. To correct it you can
a) use reduction (it is the best solution):
#pragma omp parallel num_threads(16) reduction(+:a,b)
b) use atomic operations (generally it is less efficient):
#pragma omp atomic
a++;
#pragma omp atomic
b++;
c) use critical section (i.e. locks, which is the worst solution):
#pargma omp critical
{
a++;
b++;
}

That's just pure luck, helped by inherent timing of the operation. Since this is the first parallel section in your program, threads are built instead of reused. That occupies the main thread for a while and results in threads starting sequentially. Therefore the chance of overlapping operations is lower. Try adding a #pragma omp barrier in between to increase the chance of race conditions becoming apparent.

Related

Using omp_get_num_threads() inside the parallel section

I would like to set the number of threads in OpenMP. When I use
omp_set_num_threads(2);
printf("nthread = %d\n", omp_get_num_threads());
#pragma omp parallel for
...
I see nthreads=1. Explained here, the number of reported threads belongs to serial section which is always 1. However, when I move it to the line after #pragma, I get compilation error that after #pragma, for is expected. So, how can I fix that?
Well, yeah, omp parallel for expects a loop in the next line. You can call omp_get_num_threads inside that loop. Outside a parallel section, you can call omp_get_max_threads for the maximum number of threads to spawn. This is what you are looking for.
int max_threads = omp_get_max_threads();
#pragma omp parallel for
for(...) {
int current_threads = omp_get_num_threads();
assert(current_threads == max_threads);
}
#pragma omp parallel
{
int current_threads = omp_get_num_threads();
# pragma omp for
for(...) {
...
}
}

OpenMP nested parallelism with sections

I have the following situation: I have a big outer for loop that essentially contains a function foo(). Within foo(), there are bar1() and bar2() that can be carried out concurrently, and bar3() that need to be performed after bar1() and bar2() are done. I have parallelized the big outer loop, and section bar1() and bar2(). I assume that each outer loop thread will generate their own section threads, is this correct?
If the assumption above is correct, how do I get bar3() to perform only after threads carrying out bar1() and bar2() finished? If I use critical, it will halt on all threads, including the outer for loop. If I use single, there's no guarantee that bar1() and bar2() will finish.
If the assumption above is not correct, how do I force the outer loop threads to resuse threads for bar1(), bar2() and not generate new threads every time?
Note that temp is a variable whose init and clear are expensive so I pull init and clear outside the for loop. It further complicates matter because both bar1() and bar2() needs some kind of temp variable. Optimally, temp should be init and cleared for each thread that is created, but I'm not sure how to force that for the threads generated for sections. (Without the sections pragma, it works fine in the parallel block).
main(){
#pragma omp parallel private(temp)
init(temp);
#pragma omp for schedule(static)
for (i=0;i<100000;i++) {
foo(temp);
}
clear(temp);
}
foo() {
init(x); init(y);
#pragma omp sections
{
{ bar1(x,temp); }
#pragma omp section
{ bar2(y,temp); }
}
bar3(x,y,temp);
}
I believe that simply parallelizing the for loop should give you enough parallelism to saturate the resources in CPU. But if you really want to run two functions in parallel, following code should work.
main(){
#pragma omp parallel private(temp)
{
init(temp);
#pragma omp for schedule(static)
for (i=0;i<100000;i++) {
foo(temp);
}
clear(temp);
}
}
foo() {
init(x); init(y);
#pragma omp task
bar1(x,temp);
bar2(y,temp);
#pragma omp taskwait
bar3(x,y,temp);
}

How to nest parallel loops in a sequential loop with OpenMP

I am currently working on a matrix computation with OpenMP. I have several loops in my code, and instead on calling for each loop #pragma omp parallel for[...] (which create all the threads and destroy them right after) I would like to create all of them at the beginning, and delete them at the end of the program in order to avoid overhead.
I want something like :
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp for[...]
for(...)
}
The problem is that I have some parts those have to be execute by only one thread, but in a loop, which contains loops those have to be execute in parallel... This is how it looks:
//have to be execute by only one thread
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp parallel for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
(The loop boundaries are quite small in my example. In my program the number of iterations can goes until 20.000...)
One of my first idea was to do something like this:
//have to be execute by only one thread
#pragma omp parallel //creating all the threads at the beginning
{
#pragma omp master //or single
{
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
}
} //deleting all the threads
It doesn't compile, I get this error from gcc: "work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region".
I know it surely comes from the "wrong" nesting, but I can't understand why it doesn't work. Do I need to add a barrier before the parallel zone ? I am a bit lost and don't know how to solve it.
Thank you in advance for your help.
Cheers.
Most OpenMP runtimes don't "create all the threads and destroy them right after". The threads are created at the beginning of the first OpenMP section and destroyed when the program terminates (at least that's how Intel's OpenMP implementation does it). There's no performance advantage from using one big parallel region instead of several smaller ones.
Intel's runtimes (which is open source and can be found here) has options to control what threads do when they run out of work. By default they'll spin for a while (in case the program immediately starts a new parallel section), then they'll put themselves to sleep. If the do sleep, it will take a bit longer to start them up for the next parallel section, but this depends on the time between regions, not the syntax.
In the last of your code outlines you declare a parallel region, inside that use a master directive to ensure that only the master thread executes a block, and inside the master block attempt to parallelise a loop across all threads. You claim to know that the compiler errors arise from incorrect nesting but wonder why it doesn't work.
It doesn't work because distributing work to multiple threads within a region of code which only one thread will execute doesn't make any sense.
Your first pseudo-code is better, but you probably want to extend it like this:
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp single
{ ... }
#pragma omp for[...]
for(...)
}
The single directive ensures that the block of code it encloses is only executed by one thread. Unlike the master directive single also implies a barrier at exit; you can change this behaviour with the nowait clause.

Trouble with nested loops and openmp

I am having trouble applying openmp to a nested loop like this:
#pragma omp parallel shared(S2,nthreads,chunk) private(a,b,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("\nNumber of threads = %d\n", nthreads);
}
#pragma omp for schedule(dynamic,chunk)
for(a=0;a<NREC;a++){
for(b=0;b<NLIG;b++){
S2=S2+cos(1+sin(atan(sin(sqrt(a*2+b*5)+cos(a)+sqrt(b)))));
}
} // end for a
} /* end of parallel section */
When I compare the serial with the openmp version, the last one gives weird results. Even when I remove #pragma omp for, the results from openmp are not correct, do you know why or can point to a good tutorial explicit about double loops and openmp?
This is a classic example of a race condition. Each of your openmp threads is accessing and updating a shared value at the same time, and there's no guaantee that some of the updates won't get lost (at best) or the resulting answer won't be gibberish (at worst).
The thing with race conditions is that they depend sensitively on the timing; in a smaller case (eg, with smaller NREC and NLIG) you might sometimes miss this, but in a larger case, it'll eventually always come up.
The reason you get wrong answers without the #pragma omp for is that as soon as you enter the parallel region, all of your openmp threads start; and unless you use something like an omp for (a so-called worksharing construct) to split up the work, each thread will do everything in the parallel section - so all the threads will be doing the same entire sum, all updating S2 simultatneously.
You have to be careful with OpenMP threads updating shared variables. OpenMP has atomic operations to allow you to safely modify a shared variable. An example follows (unfortunately, your example is so sensitive to summation order it's hard to see what's going on, so I've changed your sum somewhat:). In the mysumallatomic, each thread updates S2 as before, but this time it's done safely:
#include <omp.h>
#include <math.h>
#include <stdio.h>
double mysumorig() {
double S2 = 0;
int a, b;
for(a=0;a<128;a++){
for(b=0;b<128;b++){
S2=S2+a*b;
}
}
return S2;
}
double mysumallatomic() {
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
double mysumonceatomic() {
double S2 = 0.;
#pragma omp parallel shared(S2)
{
double mysum = 0.;
#pragma omp for
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
mysum += (double)a*b;
}
}
#pragma omp atomic
S2 += mysum;
}
return S2;
}
int main() {
printf("(Serial) S2 = %f\n", mysumorig());
printf("(All Atomic) S2 = %f\n", mysumallatomic());
printf("(Atomic Once) S2 = %f\n", mysumonceatomic());
return 0;
}
However, that atomic operation really hurts parallel performance (after all, the whole point is to prevent parallel operation around the variable S2!) so a better approach is to do the summations and only do the atomic operation after both summations rather than doing it 128*128 times; that's the mysumonceatomic() routine, which only incurs the synchronization overhead once per thread rather than 16k times per thread.
But this is such a common operation that there's no need to implment it yourself. One can use an OpenMP built-in functionality for reduction operations (a reduction is an operation like calculating a sum of a list, finding the min or max of a list, etc, which can be done one element at a time only by looking at the result so far and the next element) as suggested by #ejd. OpenMP will work and is faster (it's optimized implementation is much faster than what you can do on your own with other OpenMP operations).
As you can see, either approach works:
$ ./foo
(Serial) S2 = 66064384.000000
(All Atomic) S2 = 66064384.000000
(Atomic Once) S2 = 66064384.00000
The problem isn't with double loops but with variable S2. Try putting a reduction clause on your for directive:
#pragma omp for schedule(dynamic,chunk) reduction(+:S2)

how to use omp barrier on a while loop with no equal number of iterations for threads

I'm trying to implement the listranking problem (known also by shortcutting) with omp to have the sums prefixes of the array W.
I don't know if i use correctly the flush pragma..
And i have a warning when compiling "barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
main(int argc, char *argv[])
{
int Q[9]={1,2,3,4,5,6,7,8,0};
int W[8]={1,2,3,4,5,6,7,8};
int i,j=6,id;
printf("Before:\n");
for(j=0;j<8;j++)
printf("%d",W[j]);
printf("\n");
#pragma omp parallel for shared(Q,W) private(id) num_threads(7)
for (i=6; i>=0; i--)
{
id= omp_get_thread_num();
while((Q[i] !=0)&& (Q[Q[i]] !=0))
{
#pragma omp flush(W)
W[i]=W[i]+W[Q[i]];
#pragma omp flush(W)
printf("Am %d \t W[%d]= %d",id,i,W[i]);
#pragma omp barrier
#pragma omp flush(Q)
Q[i]=Q[Q[i]];
#pragma omp flush(Q)
printf("Am %d \n Q[%d]= %d",id,i,Q[i]);
};
}
printf("Result:\n");
for(j=0; j<8; j++)
printf("%d \t",W[j]);
printf("\n");
}
PLEAAAAAAAAAAAASE HELP!
You can't use a barrier inside an omp parallel for, you can pretty much only use a barrier inside an omp parallel region.
The reason for this is because if your loop is from 1 to N, a barrier inside will effectively create N threads which will have a negative perf impact if N is large.
I didn't lookup the algorithm here, but two reasonable choices are to refactor to use 2 parallel for loops one after the other where the barrier is, or to refactor your algorithm to use a #pragma parallel region.
I looked up the list ranking algorithm, you will be well served to find an implementation of prefix sum or scan if you must use openmp.
-Rick

Resources