I am trying to make this for-loop parallelized by using Openmp, i recognized that there reduction in this loop so i added "#pragma omp parallel for reduction(+,ftab)",but it did not work and it gave me this error :
error: user defined reduction not found for ‘ftab’.
#pragma omp parallel for reduction(+:ftab)
for (i = 1; i <= 65536; i++) ftab[i] += ftab[i-1];

The operation you want to do is prefix sum. It can be done in parallel. A simple way is to use thrust::inclusive_scan with OpenMP or TBB backend.
thrust::inclusive_scan(thrust::omp::par, ftab, ftab + 65536, fab);
thrust::inclusive_scan(thrust::tbb::par, ftab, ftab + 65536, fab);
You could also implement it by yourself as referenced in the Wikipedia page.


Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
curr[i] = (char*)(curr+nx) + i*ny;
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
next[i] = (char*)(next+nx) + i*ny;
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
Is there any problematic mistake in my way? How can I improve this?
In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];

how to avoid overhead of openMP in nested loops

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
final += sum;
With this output and runtime:
real 0m5.847s
user 0m5.628s
sys 0m0.212s
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
#pragma omp single
final += sum;
With this output and runtime:
real 0m5.476s
user 0m4.964s
sys 0m0.504s
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?
An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.
You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

How to avoid fork-join when calling cblas_sgemm in MKL?

The code is like this:
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
When the matrix is not very large, the fork-join cost is very obvious, especially when this is run on MIC. Besides, separate the mission by hand will cause some problem on MIC as MKL Performance on Intel Phi shows.
//separate the left and result matrix by hand.
//not a wise solution on MIC
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
If there is a technique that I can use code:
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
where cblas_sgemm uses the threads forked out of the for loop since MKL also uses OpenMP to create threads.
Sincerely, FatRabb1t.
You could do that by linking the sequential version of MKL, so that cblas_sgemm will not fork multiple threads to calculate the matrix.
On ther other hand you could use OpenMP parallel for to speed up your code.
#pragma omp parallel for
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
By this way, you fork-join the threads only once instead of loop_count times.
If you are using Intel compiler icc/icpc, you could link the sequential MKL with the compiler option -mkl=sequential instead of -mkl.
If you are using other compilers such as gcc, you could use MKL link line advisor to help you generate the desired link line options.

OpenMP average of an array

I'm trying to learn OpenMP for a program I'm writing. For part of it I'm trying to implement a function to find the average of a large array. Here is my code:
double mean(double* mean_array){
double mean = 0;
omp_set_num_threads( 4 );
#pragma omp parallel for reduction(+:mean)
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
printf("hello %d\n", omp_get_thread_num());
mean = mean/aSize;
return mean;
However if I run the code it runs slower than the sequential version. Also for the print statement I get:
hello 0
hello 0
Which doesn't make much sense to me, shouldn't there be 4 hellos?
Any help would be appreciated.
First, the reason why you are not seeing 4 "hello"s, is because the only part of the program which is executed in parallel is the so called parallel region enclosed within an #pragma omp parallel. In your code that is the loop body (since the omp parallel directive is attached to the for statement), the printf is in the sequential part of the program.
rewriting the code as follows would do the trick:
double mean = 0;
#pragma omp parallel num_threads(4)
#pragma omp for reduction(+:mean)
for (int i=0; i<aSize; i++) {
mean += mean_array[i];
mean /= aSize;
printf("hello %d\n", omp_get_thread_num());
Second, the fact your program runs slower than the sequential version, it can depend on multiple factors. First of all, you need to make sure the array is large enough so that the overhead of creating those threads (which usually happens when the parallel region is created) is negligible. Also, for small arrays you may be running into "cache false sharing" issues in which threads are competing for the same cache line causing performance degradation.

Error while working with OpenMP: Parallel reduction calculation is invalid

While working with C and OpenMP to use parallel processing on a set of data I keep getting the following errors with my for loops.
Parallel reduction calculation is invalid!
Parallel atomic calculation is invalid!
The code is:
#pragma omp parallel for num_threads(numberOfThreads \
reduction(+:number_in_circle) shared(count)
for(count = 0; count < iterations; count++)
//calculate number in circle
# pragma omp parallel for num_threads(numberOfThreads) private(x, y,\
dist_sqrd) shared(count, number_in_circle, iterations)
for(count = 0; count < iterations; count++)
//calculate number_in_circle using atomic instruction to add to it.
Is there something wrong with my syntax or is it something wrong with the loop itself?
I'm not sure your copy of the OpenMP directives is 100% correct but there are definitely issues on the ones here:
#pragma omp parallel for num_threads(numberOfThreads \
reduction(+:number_in_circle) shared(count)
for(count = 0; count < iterations; count++)
num_threads(numberOfThreads misses the closing parenthesis
shared(count) is invalid since count is the index of the for loop you want to parallelise. Trying to define such index private is both stupid and explicitly forbidden by the OpenMP standard
This same remark goes for the second directive you cited.
Regarding the atomic and reduction clause errors, there isn't enough in your code snippet to give any advice.
