Is the following code a valid OpenMP parallel program? - c

I'm studying for an exam at the moment and one of the practice questions is to write a piece of code which implements the parallel sum of all element of an array on an SMP computer. I've written a few OpenMP programs before which were much large but didn't take advantage really of the clauses and directives and I came across the reduction clause so I was wondering if the following piece is a parallel program because I'm wondering how relatively simplistic could one reduce the program to but still retain parallelisation?
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv){
int n = atoi(argv[1]);
double * X;
X = malloc(n * sizeof(double));
for(int i = 0; i < n; i++){ X[i] = 2.0; }
int i = 0;
double sum = 0.0;
omp_set_num_threads(atoi(argv[2]));
#pragma omp parallel for private(i), shared(X) reduction(+: sum)
for(i = 0; i < n; i++){
sum += X[i];
}
printf("Sum is : %0.2f\n", sum);
return 0;
}

I came across the reduction clause so I was wondering if the
following piece is a parallel program
From the OpenMP standard on the #pragma omp parallel :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region. The thread that encountered
the parallel construct becomes the master thread of the new team, with
a thread number of zero for the duration of the new parallel region.
All threads in the new team, including the master thread, execute the
region. Once the team is created, the number of threads in the team
remains constant for the duration of that parallel region.
So yes, it is a parallel program as long as the number of threads that you have explicitly set it on the method:
omp_set_num_threads(atoi(argv[2]));
is greater than 1.
In:
#pragma omp parallel for private(i), shared(X) reduction(+: sum)
The private i can be omitted, because since it is used as the index on the parallel loop, OpenMP will make it private.

Why not just run it and test. You can run something like this on your laptop/desktop as well.
Yes, the reduction operation in this code is parallel (sum computation).
However, setting elements of array X to 2.0 is sequential. You would want a #pragma omp parallel for over for(int i = 0; i < n; i++){ X[i] = 2.0; } in order to make the program truly parallel. Otherwise, Amdahl's law will apply.

Related

Why should I use a reduction rather than an atomic variable?

Assume we want to count something in an OpenMP loop. Compare the reduction
int counter = 0;
#pragma omp for reduction( + : counter )
for (...) {
...
counter++;
}
with the atomic increment
int counter = 0;
#pragma omp for
for (...) {
...
#pragma omp atomic
counter++
}
The atomic access provides the result immediately, while a reduction only assumes its correct value at the end of the loop. For instance, reductions do not allow this:
int t = counter;
if (t % 1000 == 0) {
printf ("%dk iterations\n", t/1000);
}
thus providing less functionality.
Why would I ever use a reduction instead of atomic access to a counter?
Short answer:
Performance
Long Answer:
Because an atomic variable comes with a price, and this price is synchronization.
In order to ensure that there is no race conditions i.e. two threads modifying the same variable at the same moment, threads must synchronize which effectively means that you lose parallelism, i.e. threads are serialized.
Reduction on the other hand is a general operation that can be carried out in parallel using parallel reduction algorithms.
Read this and this articles for more info about parallel reduction algorithms.
Addendum: Getting a sense of how a parallel reduction work
Imagine a scenario where you have 4 threads and you want to reduce a 8 element array A. What you could do this in 3 steps (check the attached image to get a better sense of what I am talking about):
Step 0. Threads with index i<4 take care of the result of summing A[i]=A[i]+A[i+4].
Step 1. Threads with index i<2 take care of the result of summing A[i]=A[i]+A[i+4/2].
Step 2. Threads with index i<4/4 take care of the result of summing A[i]=A[i]+A[i+4/4]
At the end of this process you will have the result of your reduction in the first element of A i.e. A[0]
Performance is the key point.
Consider the following program
#include <stdio.h>
#include <omp.h>
#define N 1000000
int a[N], sum;
int main(){
double begin, end;
begin=omp_get_wtime();
for(int i =0; i<N; i++)
sum+=a[i];
end=omp_get_wtime();
printf("serial %g\t",end-begin);
begin=omp_get_wtime();
# pragma omp parallel for
for(int i =0; i<N; i++)
# pragma omp atomic
sum+=a[i];
end=omp_get_wtime();
printf("atomic %g\t",end-begin);
begin=omp_get_wtime();
# pragma omp parallel for reduction(+:sum)
for(int i =0; i<N; i++)
sum+=a[i];
end=omp_get_wtime();
printf("reduction %g\n",end-begin);
}
When executed (gcc -O3 -fopenmp), it gives :
serial 0.00491182 atomic 0.0786559 reduction 0.001103
So approximately atomic=20xserial=80xreduction
The 'reduction' exploits properly the parallelism, and with a 4 cores computer, we can get 3--6 performances boosts vs "serial".
Now, "atomic" is 20 times longer than "serial". Not only, as explained in the previous answer, the serialization of memory accesses disables parallelism, but all memory accesses are done by atomic operations. These operations require at least 20--50 cycles on modern computers and will dramatically slow down your performances if used intensively.

Using OpenMP in for loop results in incorrect output [duplicate]

I am trying to use OpenMP to add the numbers in an array. The following is my code:
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static)
for(i=0;i<snum;i++)
{
int* tmpsum = input+i;
sum += *tmpsum;
}
This does not produce the right result for sum. What's wrong?
Your code currently has a race condition, which is why the result is incorrect. To illustrate why this is, let's use a simple example:
You are running on 2 threads and the array is int input[4] = {1, 2, 3, 4};. You initialize sum to 0 correctly and are ready to start the loop. In the first iteration of your loop, thread 0 and thread 1 read sum from memory as 0, and then add their respective element to sum, and write it back to memory. However, this means that thread 0 is trying to write sum = 1 to memory (the first element is 1, and sum = 0 + 1 = 1), while thread 1 is trying to write sum = 2 to memory (the second element is 2, and sum = 0 + 2 = 2). The end result of this code depends on which one of the threads finishes last, and therefore writes to memory last, which is a race condition. Not only that, but in this particular case, neither of the answers that the code could produce are correct! There are several ways to get around this; I'll detail three basic ones below:
#pragma omp critical:
In OpenMP, there is what is called a critical directive. This restricts the code so that only one thread can do something at a time. For example, your for-loop can be written:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp critical
sum += *tmpsum;
}
This eliminates the race condition as only one thread accesses and writes to sum at a time. However, the critical directive is very very bad for performance, and will likely kill a large portion (if not all) of the gains you get from using OpenMP in the first place.
#pragma omp atomic:
The atomic directive is very similar to the critical directive. The major difference is that, while the critical directive applies to anything that you would like to do one thread at a time, the atomic directive only applies to memory read/write operations. As all we are doing in this code example is reading and writing to sum, this directive will work perfectly:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp atomic
sum += *tmpsum;
}
The performance of atomic is generally significantly better than that of critical. However, it is still not the best option in your particular case.
reduction:
The method you should use, and the method that has already been suggested by others, is reduction. You can do this by changing the for-loop to:
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
sum += *tmpsum;
}
The reduction command tells OpenMP that, while the loop is running, you want each thread to keep track of its own sum variable, and add them all up at the end of the loop. This is the most efficient method as your entire loop now runs in parallel, with the only overhead being right at the end of the loop, when the sum values of each of the threads need to be added up.
Use reduction clause (description at MSDN).
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i=0;i<snum;i++)
{
sum += input[i];
}

OpenMP summations explanation

This is my first time using OpenMP and I feel I have a core misunderstanding in the implementation of the following:
#include <omp.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int i, n;
float a[100], b[100], result;
/* Some initializations */
n = 100;
result = 0.0;
for (i=0; i < n; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
//pragma statement for omp here
for (int i=0; i < n; i++)
result = result + (a[i] * b[i]);
printf("Final result= %f\n",result);
}
The program is designed to calculate a dot product which involves a summation.
In this question the person answering suggests using reduction to implement the parallel summation in the for loop using #pragma omp parallel for reduction(+:results) however when experimenting I get the same answer as if I just used #pragma omp parallel for, which naively I assumed to be correct but left me with a feeling of unease as I could not find any documentation saying this isn't correct. An explanation of why I am probably wrong would be helpful.
Using #pragma omp parallel for reduction(+:result) is correct. #pragma omp parallel for is wrong. The latter means that all threads write to result in an unprotected way. It is a classical race condition. In practice, you may very well get the same result by coincidence, for example because the hardware works atomically, the OS doesn't schedule threads on different cores or just pure luck. Don't be fooled, the code is still wrong.
Unfortunately, you cannot prove a code to be correct, just by showing it produces a correct result some times. Just so you cannot show two codes to be equal by testing having them produce the same result a few times. Or in other words, a incorrect code does not always reveal itself easily.

How to declare arrays in omp pragma

I'm modifying existing library from single thread to multi threading. I have code like a provided below. I can't understand how to declare arrays x, y, array1, array2. Which of them I should declare as share or threadprivate. Do I need use flush. If yes in which case ?
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for
for(i=0, i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
array1[array2[j]] += z[j]+j;
return array1[j];
}
The problem I see in your code is in this line
array1[array2[j]] += z[j]+j;
This means that array1 can potentially be modified by whichever j index. And j in the context of the function myfunc() corresponds to index i at the upper level. The trouble is that i is the index upon which the loop is parallelised, therefore, this means that array1 can be modified concurrently at any moment by any thread.
The crucial question now is to know if array2 can have the same value for different indexes:
If you are sure that for whatever j1 != j2 you have array2[j1] != array2[j2], then your code is trivially parallelisable.
If there are values j1 != j2 for which you have array[j1] == array[j2], then you have dependencies across iterations for array1 and the code is no longer (simply and/or effectively) parallelisable.
So let's assume we are in the former case, then the OpenMP directives you have already in the code are sufficient:
i needs to be private but is implicitly already so as it is the index of the parallelised loop;
x and y should to be shared (which they are by default) since their access index is the one that is distributed in parallel (namely i) so their parallel updates do not overlap;
array2 is only accessed in read mode so it's a no brainer shared (which it is by default again);
array1 is read and written, but due to our initial assumption, there are no possible collisions between threads as their sets of indexes to access it are disjoin. Therefore, the default shared qualifier just works fine.
But now, if we are in the case where array2 allows for non-disjoin sets of indexes for accessing array1, we will have to preserve the ordering of these accesses / updates of array1. This can be done with the ordered clause / directive. And since we still want the parallelisation to be (somewhat) effective, we will have to add a schedule(static,1) clause to the parallel directive. For more details about this, please refer to this great answer. Your code would now look like this:
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for schedule(static,1) ordered
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
int tmp = z[j]+j;
#pragma omp ordered
array1[array2[j]] += tmp;
return array1[j];
}
This would (I think) work and be in term of parallelism not too bad (for a limited number of threads), but this has a big (enormous) flaw: it generates tons of false sharing while updating x and y. Therefore, it might be more advantageous to use some per-thread copies of these and to only update the global arrays at the end. The central part of code snippet would then look something like this (not tested at all):
#pragma omp parallel
#pragma omp single
int nbth = omp_get_num_threads();
int *xm = malloc(1000000*nbth*sizeof(int));
int *ym = malloc(1000000*nbth*sizeof(int));
#pragma omp parallel
{
int tid = omp_get_thread_num();
int *xx = xm+1000000*tid;
int *yy = ym+1000000*tid;
#pragma omp for schedule(static,1) ordered
for(i=0; i<100; i++)
{
yy[i] = i*i-3*i-10*random();
xx[i] = myfunc(i, y[i])
}
#pragma omp for
for (i=0; i<100; i++)
{
int j;
x[i] = 0;
y[i] = 0;
for (j=0; j<nbth; j++)
{
x[i] += xm[j*1000000+i];
y[i] += ym[j*1000000+i];
}
}
}
free(xm);
free(ym);
This will avoid the false sharing, but will increase the number of memory accesses and the overhead of parallelisation. So it might not be very beneficial after all. You'll have to see it for yourself in your actual code.
BTW, the fact that i only loops until 100 looks suspicious to me when the corresponding arrays are declared to be 1000000 long. If 100 is truly the correct size for the loop, then probably the parallelisation isn't worth it anyway...
EDIT:
As Jim Cownie pointed it out in a comment, I missed the call to random() as source of dependency across iterations, preventing from proper parallelisation. I'm not sure how relevant this is in the context of your actual code (I doubt you truly fill your y array with random data) but in case you do, you'll have to change this part in order to do it in parallel (otherwise, the serialisation needed to have the random number series generated will just kill whichever gain from parallelisation). But generating non-correlated pseudo-random series in parallel is not as simple as it sounds. You can use rand_r() instead of random() as a thread-safe alternative for the RNG and initialise its seed per-thread to different values. However, you're not sure that one thread's series won't collide with another thread's one too soon (with a thread starting to generate the very same series than another one after a while, messing-up your expected asymptotic behaviour).
As I'm pretty sure you're not truly interested in that, I won't develop any further (this is a whole question all by itself), but I will just use the (not so good) rand_r() trick. If you want more details on a possible alternative for generating good parallel random series, just ask another question.
The case where no problem comes from array2 (disjoin sets of indexes), the code would become:
// global variable
unsigned int seed;
#pragma omp threadprivate(seed)
// done just once somewhere
#pragma omp parallel
seed = omp_get_thread_num(); //or something else, but different for each thread
// then the parallelised loop
#pragma omp parallel for
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*rand_r(&seed);
x[i] = myfunc(i, y[i])
}
Then the other case would have to use the same trick in addition to what has already been described. But again, keep in mind that this isn't good enough for serious RNG based computation (like Monte-Carlo methods). Its does the job if all you want is generate some values for testing purpose, but it won't pass any serious statistical quality test.

How do I deal with a data race in OpenMP?

I am trying to use OpenMP to add the numbers in an array. The following is my code:
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static)
for(i=0;i<snum;i++)
{
int* tmpsum = input+i;
sum += *tmpsum;
}
This does not produce the right result for sum. What's wrong?
Your code currently has a race condition, which is why the result is incorrect. To illustrate why this is, let's use a simple example:
You are running on 2 threads and the array is int input[4] = {1, 2, 3, 4};. You initialize sum to 0 correctly and are ready to start the loop. In the first iteration of your loop, thread 0 and thread 1 read sum from memory as 0, and then add their respective element to sum, and write it back to memory. However, this means that thread 0 is trying to write sum = 1 to memory (the first element is 1, and sum = 0 + 1 = 1), while thread 1 is trying to write sum = 2 to memory (the second element is 2, and sum = 0 + 2 = 2). The end result of this code depends on which one of the threads finishes last, and therefore writes to memory last, which is a race condition. Not only that, but in this particular case, neither of the answers that the code could produce are correct! There are several ways to get around this; I'll detail three basic ones below:
#pragma omp critical:
In OpenMP, there is what is called a critical directive. This restricts the code so that only one thread can do something at a time. For example, your for-loop can be written:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp critical
sum += *tmpsum;
}
This eliminates the race condition as only one thread accesses and writes to sum at a time. However, the critical directive is very very bad for performance, and will likely kill a large portion (if not all) of the gains you get from using OpenMP in the first place.
#pragma omp atomic:
The atomic directive is very similar to the critical directive. The major difference is that, while the critical directive applies to anything that you would like to do one thread at a time, the atomic directive only applies to memory read/write operations. As all we are doing in this code example is reading and writing to sum, this directive will work perfectly:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp atomic
sum += *tmpsum;
}
The performance of atomic is generally significantly better than that of critical. However, it is still not the best option in your particular case.
reduction:
The method you should use, and the method that has already been suggested by others, is reduction. You can do this by changing the for-loop to:
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
sum += *tmpsum;
}
The reduction command tells OpenMP that, while the loop is running, you want each thread to keep track of its own sum variable, and add them all up at the end of the loop. This is the most efficient method as your entire loop now runs in parallel, with the only overhead being right at the end of the loop, when the sum values of each of the threads need to be added up.
Use reduction clause (description at MSDN).
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i=0;i<snum;i++)
{
sum += input[i];
}

Resources