Reordered output despite critical section - c

I'm trying to adapt this pascal triangle program to a parallel program using OpenMp. I used the for directive to parallelize the printPas function for loop, and put the conditional statements inside of the critical section so only one thread can print at a time, but it seems like I'm still getting a data race because my output is really inconsistent.
#include <stdio.h>
#ifndef N
#define N 2
#endif
unsigned int t1[2*N+1], t2[2*N+1];
unsigned int *e=t1, *r=t2;
int l = 0;
//the problem is here in this function
void printPas() {
#pragma omp parallel for private(l)
for (l=0; l<2*N+1; l++) {
#pragma omp critical
if (e[l]==0)
printf(" ");
else
printf("%6u", e[l]);
}
printf("\n");
}
void update() {
r[0] = e[1];
#pragma omp parallel for
for (int u=1; u<2*N; u++)
r[u] = e[u-1]+e[u+1];
r[2*N] = e[2*N-1];
unsigned int *tmp = e; e=r; r=tmp;
}
int main() {
e[N] = 1;
for (int i=0; i<N; i++) {
printPas();
update();
}
printPas();
}

Your critical section is causing the prints to run sequentially. Therefore, the code takes longer using 'critical' than it would if you didn't attempt to parallelise it.
Using different threads to print, you have no idea which one will access the critical section first. Therefore, the for-loop will not execute in the order that you would hope.
I suggest either removing the parallel directive ("#pragma omp parallel for private(l)"), or removing the 'critical' and accepting that the prints will come out in a different order every time.

Related

Is a function without loop parallelizable?

considering the code below, can we consider it parallel even if there are no loops?
#include <omp.h>
int main(void) {
#pragma omp parallel
{
int a = 1;
a = 0;
}
return 0;
}
Direct Answer:
Yes, here, the section of your code,
int a = 1;
a = 0;
Runs in parallel, P times, where P is the number of cores on your machine.
For example on a four core machine, the following code (with the relevant imports),
int main(void) {
#pragma omp parallel
{
printf("Thread number %d", omp_get_thread_num());
}
return 0;
}
would output:
Thread number 0
Thread number 1
Thread number 2
Thread number 3
Note that when running in parallel, there is no guarantee on the order of the output, so the output could just as likely be something like:
Thread number 1
Thread number 2
Thread number 0
Thread number 3
Additionally, if you wanted to specify the number of threads used in the parallel region, instead of #pragma omp parallel you could write, #pragma omp parallel num_threads(4).
Further Explanation:
If you are still confused, it may be helpful to better understand the difference between parallel for loops and parallel code regions.
#pragma omp parallel tells the compiler that the following code block may be executed in parallel. It guarantees that all code within the parallel region will have finished execution before continuing to subsequent code.
In the following (toy) example, the programmer is guaranteed that after the parallel region, the array will have all entries set to zero.
int *arr = malloc(sizeof(int) * 128);
const int P = omp_get_max_threads();
#pragma omp parallel num_threads(P)
{
int local_start = omp_get_thread_num();
int local_end = local_start + (100 / P);
for (int i = local_start; i < local_end; ++i) {
arr[i] = 0;
}
}
// any code from here onward is guaranteed that arr contains all zeros!
Ignoring differences in scheduling, this task could equivalently be accomplished using a parallel for loop as follows:
int *arr = malloc(sizeof(int) * 128);
const int P = omp_get_max_threads();
#pragma omp parallel num_threads(P) for
for (int i = 0; i < 128; ++i) {
arr[i] = 0;
}
// any code from here onward is guaranteed that arr contains all zeros!
Essentially, #pragma omp parallel enables you to describe regions of code that can execute in parallel - this can be much more flexible than a parallel for loop. In contrast, #pragma omp parallel for should generally be used to parallelize loops with independent iterations.
I can further elaborate on the differences in performance, if you would like.

Is it beneficial to parallelize variable declaration?

I wonder if it is beneficial when writing a parallel program to insert variables declarations into the parallel section? Because the Amdahl's law says that if more portion of the program is parallel it's better but I don't see the point to parallelize variables declaration and return statements, for example, this is the normal parallel code:
#include <omp.h>
int main(void) {
int a = 0;
int b[5];
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < 5; ++i) {
b[i] = a;
}
}
return 0;
}
Will it be beneficial regarding Amdahl's law to write this (so 100% of the program is parallel):
#include <omp.h>
int main(void) {
#pragma omp parallel
{
int a = 0;
int b[5];
#pragma omp for
for (int i = 0; i < 5; ++i) {
b[i] = a;
}
return 0;
}
}
These codes are not equivalent: in the first case, a and b are shared variables (since shared is the default behavior for variables), in the second case these are thread-private variables that do not exist beyond the scope of the parallel region.
Besides, the return statement within the parallel region in the second piece of code is illegal and must cause a compilation error.
As seen for instance in this OpenMP 4.0 reference card
An OpenMP executable directive applies to the succeeding structured
block or an OpenMP construct. Each directive starts with #pragma omp.
The remainder of the directive follows the conventions of the C and
C++ standards for compiler directives. A structured-block is a single
statement or a compound statement with a single entry at the top and a
single exit at the bottom.
A block that contains the return statement is not a structured-block since it does not have a single exit at the bottom (i.e. the closing brace } is not the only exit since return is another one). It may not legally follow the #pragma omp parallel directive.

Counting does not work properly in OpenMP

I have the function
void collatz(int startNumber, int endNumber, int* iter, int nThreads)
{
int i, n, counter;
int isodd; /* 1 if n is odd, 0 if even */
#pragma omp parallel for
for (i = startNumber; i <= endNumber; i++)
{
counter = 0;
n = i;
omp_set_num_threads(nThreads);
while (n > 1)
{
isodd = n%2;
if (isodd)
n = 3*n+1;
else
n/=2;
counter++;
}
iter[i - startNumber] = counter;
}
}
It works as I wish when running serial (i.e. compiling without OpenMP or commenting out #pragma omp parallel for and omp_set_num_threads(nThreads);). However, the parallel version produces the wrong result and I think it is because the counter variable need to be set to zero at the beginning of each for loop and perhaps another thread can work with the non-zeroed counter value. But even if I use #pragma omp parallel for private(counter), the problem still occurs. What am I missing?
I compile the program as C89.
Inside your OpenMP parallel region, you are assigning values to the counter, n and isodd scalar variables. These cannot therefore be just shared as they are by default. You need to pay extra attention to them.
A quick analysis shows that as their values is only meaningful inside the parallel region and only for the current thread, so it becomes clear that they need to be declared private.
Adding a private( counter, n, isodd ) clause to your #pragma omp parallel directive should fix the issue.

Partially parallel loops using openmp tasks

Prerequisites:
parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
parallel constructs: OpenMP tasks
compiler: gcc 4.9.x (supports OpenMP 4.0)
Input:
C code with loops
loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
loop body can be partially dependent
loop cannot be split in two loops; loop body should remain solid
anything reasonable can be added to loop or loop body function definition
Code sample:
(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value)
{
int conf;
conf = prepare(config); // independent, does not change “config”
*value = process(conf, *value); // dependent, takes prev., produce next
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
…
for (int i = 0; i < N; i++)
{
loopFunc(configData, &valueData);
}
…
}
Need to:
parallelise loop using omp tasks (omp for / omp sections cannot be used)
“prepare” functions should be executed in parallel with other “prepare” or “process” functions
“process” functions should be ordered according to data dependency
What have been proposed and implemented:
define integer flag
assign to it a number of first iteration
every iteration when it needs data waits for flag to be equal to it’s iteration
update flag value when data for next iteration is ready
Like this:
(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
int conf;
conf = prepare(config); // independent, do not change “config”
while (*parSync != iteration) // wait for previous to be ready
{
#pragma omp taskyield
}
*value = process(conf, *value); // dependent, takes prev., produce next
*parSync = iteration + 1; // inform next about readiness
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
…
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
for (int i = 0; i < N; i++)
{
#pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
loopFunc(configData, &valueData, &parallelSync, i);
}
#pragma omp taskwait
…
}
What happened:
It fails. :)
The reason was that openmp task occupies openmp thread.
For example, if we define 5 openmp threads (as in the code above).
“For” loop generates 100 tasks.
OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.
If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.
What's next?
I have no other ideas how to implement the required mode of computation.
Current solution
Thanks for the idea to #parallelgeek below
int main()
{
int N = 10;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
int workers;
volatile int workingTasks = 0;
...
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
{
workers = omp_get_num_threads()-1; // reserve 1 thread for task generation
for (int i = 0; i < N; i++)
{
while (workingTasks >= workers)
{
#pragma omp taskyield
}
#pragma omp atomic update
workingTasks++;
#pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
{
loopFunc(configData, &valueData, &parallelSync, i);
#pragma omp atomic update
workingTasks--;
}
}
#pragma omp taskwait
}
}
AFAIK volatiles don't prevent hardware reordering, that's why you
could end up with a mess in memory, because data is not written yet,
while flag is already seen by the consuming thread as true.
That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
Intertask dependencies could also be met by using atomic flags from C11.

Why does the compiler ignore OpenMP pragmas?

In the following C code I am using OpenMP in a nested loop. Since race condition occurs, I want to perform atomic operations at the end:
double mysumallatomic() {
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
The thing is that #pragma omp atomic has no effect on the program behaviour, even if I remove it, nothing happens. Even if I change it to #pragma oh_my_god, I get no error!
I wonder what is going wrong here, whether I can tell the compiler to be more strict when checking omp pragmas or why I do not get an error when I make the last change
PS: For compilation I use:
gcc-4.2 -fopenmp main.c functions.c -o main_elec_gcc.exe
PS2: New code that gives me the same problem and based on gillespie idea:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NRACK 64
#define NSTARS 1024
double mysumallatomic_serial(float rocks[NRACK][3], float moon[NSTARS][3],
float qr[NRACK],float ql[NSTARS]) {
int j,i;
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
return S2;
}
double mysumallatomic(float rocks[NRACK][3], float moon[NSTARS][3],
float qr[NRACK],float ql[NSTARS]) {
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int j=0; j<NRACK; j++){
for(int i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
float myterm=ql[i]*temp_div*qr[j];
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
int main(int argc, char *argv[]) {
float rocks[NRACK][3], moon[NSTARS][3];
float qr[NRACK], ql[NSTARS];
int i,j;
for(j=0;j<NRACK;j++){
rocks[j][0]=j;
rocks[j][1]=j+1;
rocks[j][2]=j+2;
qr[j] = j*1e-4+1e-3;
//qr[j] = 1;
}
for(i=0;i<NSTARS;i++){
moon[i][0]=12000+i;
moon[i][1]=12000+i+1;
moon[i][2]=12000+i+2;
ql[i] = i*1e-3 +1e-2 ;
//ql[i] = 1 ;
}
printf(" serial: %f\n", mysumallatomic_serial(rocks,moon,qr,ql));
printf(" openmp: %f\n", mysumallatomic(rocks,moon,qr,ql));
return(0);
}
Using the flag -Wall highlights pragma errors. For example, when I misspell atomic I get the following warning.
main.c:15: warning: ignoring #pragma omp atomic1
I'm sure you know, but just in case, your example should be handled with a reduction
When you use omp parallel, the default is for all variables to be shared. This is not what you want in your case. For example, each thread will have a different value difx. Instead, your loop should be:
#pragma omp parallel for default(none),\
private(difx, dify, difz, mod2x, mod2y, mod2z, temp_sqrt, temp_div, i, j),\
shared(rocks, moon, ql, qr), reduction(+:S2)
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
I know this is an old post, but I think the problem is the order of the parameters of gcc, -fopenmp should be at the end of the compilation line.
First, depending on the implementation, reduction might be better than using atomic. I would try both and time them to see for sure.
Second, if you leave off the atomic, you may or may not see the problem (wrong result) associated with the race. It is all about timing, which from one run to the next can be quite different. I have seen cases where the result was wrong only once in 150,000 runs and others where it has been wrong all the time.
Third, the idea behind pragmas was that the user doesn't need to know about them if they don't have an effect. Besides that, the philosophy in Unix (and its derivatives) is that it is quiet unless there is a problem. Saying that, many implementations have some sort of flag so the user can get more information because they didn't know what was happening. You can try -Wall with gcc, and at least it should flag the oh_my_god pragma as being ignored.
You have
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
....
So the only parallelization will be to the for loop.
If you want to have the atomic or reduction
you have to do
#pragma omp parallel
{
#pragma omp for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
} // end of second for
} // end of 1st for
} // end of parallel code
return S2;
} // end of function
Otherwise everything after # will be comment

Resources