open MP - dot product - c

I am implementing parallel dot product in open MP
I have this code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#define SIZE 1000
int main (int argc, char *argv[]) {
float u[SIZE], v[SIZE], dp,dpp;
int i, j, tid;
dp=0.0;
for(i=0;i<SIZE;i++){
u[i]=1.0*(i+1);
v[i]=1.0*(i+2);
}
printf("\n values of u and v:\n");
for (i=0;i<SIZE;i++){
printf(" u[%d]= %.1f\t v[%d]= %.1f\n",i,u[i],i,v[i]);
}
#pragma omp parallel shared(u,v,dp,dpp) private (tid,i)
{
tid=omp_get_thread_num();
#pragma omp for private (i)
for(i=0;i<SIZE;i++){
dpp+=u[i]*v[i];
printf("thread: %d\n", tid);
}
#pragma omp critical
{
dp=dpp;
printf("thread %d\n",tid);
}
}
printf("\n dot product is %f\n",dp);
}
I am starting it with: pgcc -B -Mconcur -Minfo -o prog prog.c
And result I get in console is:
33, Loop not parallelized: innermost
39, Loop not vectorized/parallelized: contains call
48, Loop not vectorized/parallelized: contains call
What am I doing wrong?
From my side of view, everything looks Ok.

First of all, a simple 1,000-element dot product does not have enough computational cost to justify multi-threading --- you will pay so much more in communication and synchronization costs than you will gain in performance that it is not worth it.
Secondly, it looks like you are computing the full dot product in each thread, not dividing the computation across multiple threads and combining the result at the end.
Here is an example of how to do vector dot products from https://computing.llnl.gov/tutorials/openMP/#SHARED
#include <omp.h>
main ()
{
int i, n, chunk;
float a[100], b[100], result;
/* Some initializations */
n = 100;
chunk = 10;
result = 0.0;
for (i=0; i < n; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel for \
default(shared) private(i) \
schedule(static,chunk) \
reduction(+:result)
for (i=0; i < n; i++)
result += (a[i] * b[i]);
printf("Final result= %f\n",result);
}
Basically, OpenMP is good for doing coarse-grained parallelism when you have large, expensive loops. In general, when you are doing parallel programming, the larger the "chunks" of computation you can do before re-synchronizing, the better. Especially as the number of cores grows, the communication and synchronization costs will grow. Pretend that each synchronization (grabbing a new index or chunk of indexes to execute, entering a critical section, etc.) costs you 10ms, or 1M instructions to get a better idea of when/where/how to parallelize your code.

The problem is still the same as in your latest question. You are accumulating values in a variable and you must tell OpenMp how to do that:
#pragma omp for reduction(+: dpp)
for(size_t i=0; i<SIZE; i++){
dpp += u[i]*v[i];
}
Use a loop-local variable for the index and that is all you need, forget about all the stuff that you are doing around that. If you want to see then what the compiler is doing of your code run it with -S and check the assembler output. This can be very instructive, because you then learn what simple statements like that amount to when they are parallelized.
And don't use int for loop indices. Sizes and stuff like that are size_t.

Related

Reordered output despite critical section

I'm trying to adapt this pascal triangle program to a parallel program using OpenMp. I used the for directive to parallelize the printPas function for loop, and put the conditional statements inside of the critical section so only one thread can print at a time, but it seems like I'm still getting a data race because my output is really inconsistent.
#include <stdio.h>
#ifndef N
#define N 2
#endif
unsigned int t1[2*N+1], t2[2*N+1];
unsigned int *e=t1, *r=t2;
int l = 0;
//the problem is here in this function
void printPas() {
#pragma omp parallel for private(l)
for (l=0; l<2*N+1; l++) {
#pragma omp critical
if (e[l]==0)
printf(" ");
else
printf("%6u", e[l]);
}
printf("\n");
}
void update() {
r[0] = e[1];
#pragma omp parallel for
for (int u=1; u<2*N; u++)
r[u] = e[u-1]+e[u+1];
r[2*N] = e[2*N-1];
unsigned int *tmp = e; e=r; r=tmp;
}
int main() {
e[N] = 1;
for (int i=0; i<N; i++) {
printPas();
update();
}
printPas();
}
Your critical section is causing the prints to run sequentially. Therefore, the code takes longer using 'critical' than it would if you didn't attempt to parallelise it.
Using different threads to print, you have no idea which one will access the critical section first. Therefore, the for-loop will not execute in the order that you would hope.
I suggest either removing the parallel directive ("#pragma omp parallel for private(l)"), or removing the 'critical' and accepting that the prints will come out in a different order every time.

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.
There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

Reductions in parallel in logarithmic time

Given n partial sums it's possible to sum all the partial sums in log2 parallel steps. For example assume there are eight threads with eight partial sums: s0, s1, s2, s3, s4, s5, s6, s7. This could be reduced in log2(8) = 3 sequential steps like this;
thread0 thread1 thread2 thread4
s0 += s1 s2 += s3 s4 += s5 s6 +=s7
s0 += s2 s4 += s6
s0 += s4
I would like to do this with OpenMP but I don't want to use OpenMP's reduction clause. I have come up with a solution but I think a better solution can be found maybe using OpenMP's task clause.
This is more general than scalar addition. Let me choose a more useful case: an array reduction (see here, here, and here for more about array reductions).
Let's say I want to do an array reduction on an array a. Here is some code which fills private arrays in parallel for each thread.
int bins = 20;
int a[bins];
int **at; // array of pointers to arrays
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
#pragma omp single
at = (int**)malloc(sizeof *at * omp_get_num_threads());
at[omp_get_thread_num()] = (int*)malloc(sizeof **at * bins);
int a_private[bins];
//arbitrary function to fill the arrays for each thread
for(int i = 0; i<bins; i++) at[omp_get_thread_num()][i] = i + omp_get_thread_num();
}
At this point I have have an array of pointers to arrays for each thread. Now I want to add all these arrays together and write the final sum to a. Here is the solution I came up with.
#pragma omp parallel
{
int n = omp_get_num_threads();
for(int m=1; n>1; m*=2) {
int c = n%2;
n/=2;
#pragma omp for
for(int i = 0; i<n; i++) {
int *p1 = at[2*i*m], *p2 = at[2*i*m+m];
for(int j = 0; j<bins; j++) p1[j] += p2[j];
}
n+=c;
}
#pragma omp single
memcpy(a, at[0], sizeof *a*bins);
free(at[omp_get_thread_num()]);
#pragma omp single
free(at);
}
Let me try and explain what this code does. Let's assume there are eight threads. Let's define the += operator to mean to sum over the array. e.g. s0 += s1 is
for(int i=0; i<bins; i++) s0[i] += s1[i]
then this code would do
n thread0 thread1 thread2 thread4
4 s0 += s1 s2 += s3 s4 += s5 s6 +=s7
2 s0 += s2 s4 += s6
1 s0 += s4
But this code is not ideal as I would like it.
One problem is that there are a few implicit barriers which require all the threads to sync. These barriers should not be necessary. The first barrier is between filling the arrays and doing the reduction. The second barrier is in the #pragma omp for declaration in the reduction. But I can't use the nowait clause with this method to remove the barrier.
Another problem is that there are several threads that don't need to be used. For example with eight threads. The first step in the reduction only needs four threads, the second step two threads, and the last step only one thread. However, this method would involve all eight threads in the reduction. Although, the other threads don't do much anyway and should go right to the barrier and wait so it's probably not much of an issue.
My instinct is that a better method can be found using the omp task clause. Unfortunately I have little experience with the task clause and all my efforts so far with it do a reduction better than what I have now have failed.
Can someone suggest a better solution to do the reduction in logarithmic time using e.g. OpenMP's task clause?
I found a method which solves the barrier problem. This reduces asynchronously. The only remaining problem is that it still puts threads which don't participate in the reduction into a busy loop. This method uses something like a stack to push pointers to the stack (but never pops them) in critical sections (this was one of the keys as critical sections don't have implicit barriers. The stack is operated on serially but the reduction in parallel.
Here is a working example.
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <string.h>
void foo6() {
int nthreads = 13;
omp_set_num_threads(nthreads);
int bins= 21;
int a[bins];
int **at;
int m = 0;
int nsums = 0;
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
int n = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
at = (int**)malloc(sizeof *at * n * 2);
int* a_private = (int*)malloc(sizeof *a_private * bins);
//arbitrary fill function
for(int i = 0; i<bins; i++) a_private[i] = i + omp_get_thread_num();
#pragma omp critical (stack_section)
at[nsums++] = a_private;
while(nsums<2*n-2) {
int *p1, *p2;
char pop = 0;
#pragma omp critical (stack_section)
if((nsums-m)>1) p1 = at[m], p2 = at[m+1], m +=2, pop = 1;
if(pop) {
for(int i = 0; i<bins; i++) p1[i] += p2[i];
#pragma omp critical (stack_section)
at[nsums++] = p1;
}
}
#pragma omp barrier
#pragma omp single
memcpy(a, at[2*n-2], sizeof **at *bins);
free(a_private);
#pragma omp single
free(at);
}
for(int i = 0; i<bins; i++) printf("%d ", a[i]); puts("");
for(int i = 0; i<bins; i++) printf("%d ", (nthreads-1)*nthreads/2 +nthreads*i); puts("");
}
int main(void) {
foo6();
}
I sill feel a better method may be found using tasks which does not put the threads not being used in a busy loop.
Actually, it is quite simple to implement that cleanly with tasks using a recursive divide-and-conquer approach. This is almost textbook code.
void operation(int* p1, int* p2, size_t bins)
{
for (int i = 0; i < bins; i++)
p1[i] += p2[i];
}
void reduce(int** arrs, size_t bins, int begin, int end)
{
assert(begin < end);
if (end - begin == 1) {
return;
}
int pivot = (begin + end) / 2;
/* Moving the termination condition here will avoid very short tasks,
* but make the code less nice. */
#pragma omp task
reduce(arrs, bins, begin, pivot);
#pragma omp task
reduce(arrs, bins, pivot, end);
#pragma omp taskwait
/* now begin and pivot contain the partial sums. */
operation(arrs[begin], arrs[pivot], bins);
}
/* call this within a parallel region */
#pragma omp single
reduce(at, bins, 0, n);
As far as i can tell, there are no unnecessary synchronizations and there is no weird polling on critical sections. It also works naturally with a data size different than your number of ranks. I find it very clean and easy to understand. So I do indeed think this is better than both of your solutions.
But let's look at how it performs in practice*. For that we can use Score-p and Vampir:
*bins=10000 so the reduction actually takes a little bit of time. Executed on a 24-core Haswell system w/o turbo. gcc 4.8.4, -O3. I added some buffer around the actual execution to hide initialization/post-processing
The picture reveals what is happening at any thread within the application on a horizontal time-axis. The tree implementations from top to bottom:
omp for loop
omp critical kind of tasking.
omp task
This shows nicely how the specific implementations actually execute. Now it seems that the for loop is actually the fastest, despite the unnecessary synchronizations. But there are still a number of flaws in this performance analysis. For example, I didn't pin the threads. In practice NUMA (non-uniform memory access) matters a lot: Does the core does have this data in it's own cache / memory of it's own socket? This is where the task solution becomes non-deterministic. The very significant variance among repetitions is not considered in the simple comparison.
If the reduction operation becomes variable in runtime, then the task solution will become better than thy synchronized for loop.
The critical solution has some interesting aspect, the passive threads are not continuously waiting, so they will more likely consume CPU resources. This can be bad for performance e.g. in case of turbo mode.
Remember that the task solution has more optimization potential by avoiding spawning tasks that immediately return. How these solutions perform also highly depends on the specific OpenMP runtime. Intel's runtime seems to do much worse for tasks.
My recommendation is:
Implement the most maintainable solution with optimal algorithmic
complexity
Measure which parts of the code actually matter for run-time
Analyze based on actual measurements what is the bottleneck. In my experience it is more about NUMA and scheduling rather than some unnecessary barrier.
Perform the micro-optimization based on your actual measurements
Linear solution
Here is the timeline for the linear proccess_data_v1 from this question.
OpenMP 4 Reduction
So I thought about OpenMP reduction. The tricky part seems to be getting the data from the at array inside the loop without a copy. I do initialize the worker array with NULL and simply move the pointer the first time:
void meta_op(int** pp1, int* p2, size_t bins)
{
if (*pp1 == NULL) {
*pp1 = p2;
return;
}
operation(*pp1, p2, bins);
}
// ...
// declare before parallel region as global
int* awork = NULL;
#pragma omp declare reduction(merge : int* : meta_op(&omp_out, omp_in, 100000)) initializer (omp_priv=NULL)
#pragma omp for reduction(merge : awork)
for (int t = 0; t < n; t++) {
meta_op(&awork, at[t], bins);
}
Surprisingly, this doesn't look too good:
top is icc 16.0.2, bottom is gcc 5.3.0, both with -O3.
Both seem to implement the reduction serialized. I tried to look into gcc / libgomp, but it's not immediately apparent to me what is happening. From intermediate code / disassembly, they seem to be wrapping the final merge in a GOMP_atomic_start/end - and that seems to be a global mutex. Similarly icc wraps the call to the operation in a kmpc_critical. I suppose there wasn't much optimization going into costly custom reduction operations. A traditional reduction can be done with a hardware-supported atomic operation.
Notice how each operation is faster because the input is cached locally, but due to the serialization it is overall slower. Again this is not a perfect comparison due to high variances, and earlier screenshots were with different gcc version. But the trend is clear, and I also have data on the cache effects.

Why does the compiler ignore OpenMP pragmas?

In the following C code I am using OpenMP in a nested loop. Since race condition occurs, I want to perform atomic operations at the end:
double mysumallatomic() {
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
The thing is that #pragma omp atomic has no effect on the program behaviour, even if I remove it, nothing happens. Even if I change it to #pragma oh_my_god, I get no error!
I wonder what is going wrong here, whether I can tell the compiler to be more strict when checking omp pragmas or why I do not get an error when I make the last change
PS: For compilation I use:
gcc-4.2 -fopenmp main.c functions.c -o main_elec_gcc.exe
PS2: New code that gives me the same problem and based on gillespie idea:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NRACK 64
#define NSTARS 1024
double mysumallatomic_serial(float rocks[NRACK][3], float moon[NSTARS][3],
float qr[NRACK],float ql[NSTARS]) {
int j,i;
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
return S2;
}
double mysumallatomic(float rocks[NRACK][3], float moon[NSTARS][3],
float qr[NRACK],float ql[NSTARS]) {
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int j=0; j<NRACK; j++){
for(int i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
float myterm=ql[i]*temp_div*qr[j];
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
int main(int argc, char *argv[]) {
float rocks[NRACK][3], moon[NSTARS][3];
float qr[NRACK], ql[NSTARS];
int i,j;
for(j=0;j<NRACK;j++){
rocks[j][0]=j;
rocks[j][1]=j+1;
rocks[j][2]=j+2;
qr[j] = j*1e-4+1e-3;
//qr[j] = 1;
}
for(i=0;i<NSTARS;i++){
moon[i][0]=12000+i;
moon[i][1]=12000+i+1;
moon[i][2]=12000+i+2;
ql[i] = i*1e-3 +1e-2 ;
//ql[i] = 1 ;
}
printf(" serial: %f\n", mysumallatomic_serial(rocks,moon,qr,ql));
printf(" openmp: %f\n", mysumallatomic(rocks,moon,qr,ql));
return(0);
}
Using the flag -Wall highlights pragma errors. For example, when I misspell atomic I get the following warning.
main.c:15: warning: ignoring #pragma omp atomic1
I'm sure you know, but just in case, your example should be handled with a reduction
When you use omp parallel, the default is for all variables to be shared. This is not what you want in your case. For example, each thread will have a different value difx. Instead, your loop should be:
#pragma omp parallel for default(none),\
private(difx, dify, difz, mod2x, mod2y, mod2z, temp_sqrt, temp_div, i, j),\
shared(rocks, moon, ql, qr), reduction(+:S2)
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
I know this is an old post, but I think the problem is the order of the parameters of gcc, -fopenmp should be at the end of the compilation line.
First, depending on the implementation, reduction might be better than using atomic. I would try both and time them to see for sure.
Second, if you leave off the atomic, you may or may not see the problem (wrong result) associated with the race. It is all about timing, which from one run to the next can be quite different. I have seen cases where the result was wrong only once in 150,000 runs and others where it has been wrong all the time.
Third, the idea behind pragmas was that the user doesn't need to know about them if they don't have an effect. Besides that, the philosophy in Unix (and its derivatives) is that it is quiet unless there is a problem. Saying that, many implementations have some sort of flag so the user can get more information because they didn't know what was happening. You can try -Wall with gcc, and at least it should flag the oh_my_god pragma as being ignored.
You have
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
....
So the only parallelization will be to the for loop.
If you want to have the atomic or reduction
you have to do
#pragma omp parallel
{
#pragma omp for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
} // end of second for
} // end of 1st for
} // end of parallel code
return S2;
} // end of function
Otherwise everything after # will be comment

how to use omp barrier on a while loop with no equal number of iterations for threads

I'm trying to implement the listranking problem (known also by shortcutting) with omp to have the sums prefixes of the array W.
I don't know if i use correctly the flush pragma..
And i have a warning when compiling "barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
main(int argc, char *argv[])
{
int Q[9]={1,2,3,4,5,6,7,8,0};
int W[8]={1,2,3,4,5,6,7,8};
int i,j=6,id;
printf("Before:\n");
for(j=0;j<8;j++)
printf("%d",W[j]);
printf("\n");
#pragma omp parallel for shared(Q,W) private(id) num_threads(7)
for (i=6; i>=0; i--)
{
id= omp_get_thread_num();
while((Q[i] !=0)&& (Q[Q[i]] !=0))
{
#pragma omp flush(W)
W[i]=W[i]+W[Q[i]];
#pragma omp flush(W)
printf("Am %d \t W[%d]= %d",id,i,W[i]);
#pragma omp barrier
#pragma omp flush(Q)
Q[i]=Q[Q[i]];
#pragma omp flush(Q)
printf("Am %d \n Q[%d]= %d",id,i,Q[i]);
};
}
printf("Result:\n");
for(j=0; j<8; j++)
printf("%d \t",W[j]);
printf("\n");
}
PLEAAAAAAAAAAAASE HELP!
You can't use a barrier inside an omp parallel for, you can pretty much only use a barrier inside an omp parallel region.
The reason for this is because if your loop is from 1 to N, a barrier inside will effectively create N threads which will have a negative perf impact if N is large.
I didn't lookup the algorithm here, but two reasonable choices are to refactor to use 2 parallel for loops one after the other where the barrier is, or to refactor your algorithm to use a #pragma parallel region.
I looked up the list ranking algorithm, you will be well served to find an implementation of prefix sum or scan if you must use openmp.
-Rick

Resources