Error while working with OpenMP: Parallel reduction calculation is invalid - c

While working with C and OpenMP to use parallel processing on a set of data I keep getting the following errors with my for loops.
Parallel reduction calculation is invalid!
Parallel atomic calculation is invalid!
The code is:
#pragma omp parallel for num_threads(numberOfThreads \
reduction(+:number_in_circle) shared(count)
for(count = 0; count < iterations; count++)
//calculate number in circle
# pragma omp parallel for num_threads(numberOfThreads) private(x, y,\
dist_sqrd) shared(count, number_in_circle, iterations)
for(count = 0; count < iterations; count++)
//calculate number_in_circle using atomic instruction to add to it.
Is there something wrong with my syntax or is it something wrong with the loop itself?

I'm not sure your copy of the OpenMP directives is 100% correct but there are definitely issues on the ones here:
#pragma omp parallel for num_threads(numberOfThreads \
reduction(+:number_in_circle) shared(count)
for(count = 0; count < iterations; count++)
num_threads(numberOfThreads misses the closing parenthesis
shared(count) is invalid since count is the index of the for loop you want to parallelise. Trying to define such index private is both stupid and explicitly forbidden by the OpenMP standard
This same remark goes for the second directive you cited.
Regarding the atomic and reduction clause errors, there isn't enough in your code snippet to give any advice.

Related

OpenMP - Overhead when Spawning and Terminating Threads in for-loop

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.
I have a for-loop which must be ran serially which calls the new_value() function:
for(int i = 0; i < MAX_VAL; i++)
new_value();
This function opens a parallel region on each call:
void new_value()
{
#pragma omp parallel default(shared)
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
}
Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?
There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:
This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.
This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.
The answer to your second question is actually correct.
#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
new_value();
void new_value()
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).
So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

Iteration variable error in for loop: C program, OpenMP

I am trying to execute a code in C using OpenMP. The following is the code
#pragma omp parallel \
reduction(+:array[length])
{
int start = 1, distance, nthreads;
nthreads = omp_get_num_threads();
printf("%d\n", nthreads);
#pragma omp for
for (distance = 1; distance < length; distance = distance + distance)
{
for (i = length - 1; i >= start; i--)
{
array[i] = array[i] + array[i - distance];
}
start *=2;
}
}
The compiler is throwing the following error
**error**: increment expression refers to iteration variable ‘distance’
#pragma omp for
I tried to browse about this error online, but didn't find much. Any help in decoding the error would be useful.
Also, should the reduction clause be present on top top next to #pragma omp parallel \ or after #pragma omp for.
The OpenMP loop work-sharing construct requires a so called canonical loop form. You can only increment the loop variable by a loop-invariant value. You have to restructure your loop, e.g. through use of sqrt / <<. Also note that your use of start is not correct. Compute start from the loop iteration.

how to parallelize this for-loop using reduction?

I am trying to make this for-loop parallelized by using Openmp, i recognized that there reduction in this loop so i added "#pragma omp parallel for reduction(+,ftab)",but it did not work and it gave me this error :
error: user defined reduction not found for ‘ftab’.
#pragma omp parallel for reduction(+:ftab)
for (i = 1; i <= 65536; i++) ftab[i] += ftab[i-1];
The operation you want to do is prefix sum. It can be done in parallel. A simple way is to use thrust::inclusive_scan with OpenMP or TBB backend.
thrust::inclusive_scan(thrust::omp::par, ftab, ftab + 65536, fab);
or
thrust::inclusive_scan(thrust::tbb::par, ftab, ftab + 65536, fab);
You could also implement it by yourself as referenced in the Wikipedia page.

how to avoid overhead of openMP in nested loops

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?
An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.
You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

How to avoid fork-join when calling cblas_sgemm in MKL?

The code is like this:
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
When the matrix is not very large, the fork-join cost is very obvious, especially when this is run on MIC. Besides, separate the mission by hand will cause some problem on MIC as MKL Performance on Intel Phi shows.
//separate the left and result matrix by hand.
//not a wise solution on MIC
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
If there is a technique that I can use code:
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
where cblas_sgemm uses the threads forked out of the for loop since MKL also uses OpenMP to create threads.
Sincerely, FatRabb1t.
You could do that by linking the sequential version of MKL, so that cblas_sgemm will not fork multiple threads to calculate the matrix.
On ther other hand you could use OpenMP parallel for to speed up your code.
#pragma omp parallel for
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
By this way, you fork-join the threads only once instead of loop_count times.
If you are using Intel compiler icc/icpc, you could link the sequential MKL with the compiler option -mkl=sequential instead of -mkl.
If you are using other compilers such as gcc, you could use MKL link line advisor to help you generate the desired link line options.
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Resources