I have problem to parallel for-loop code in OpenMP, result of parallel for-loop is different with a sequential for-loop. How to make this code parallel with same result as sequential code.
const long nx = 20;
const long ny = 20;
const long nz = 20;
int i, j, k, a, v;
#pragma omp parallel private(tid_2, i,j,k,a,v) shared(numt_2,nx,ny,nz)
{
numt_2 = omp_get_num_threads();
tid_2 = omp_get_thread_num();
printf("Thread %d Total thread%d\n", tid_2, numt_2);
#pragma omp parallel for collapse(4) //num_threads(3)
for (i = 0; i <= nx; i++)
{
for (j = 0; j <= ny; j++)
{
for (k = 0; k <= nz; k++)
{
for (a = 0; a < 19; a++)
{
ff[fineindex(i, j, k, a)] = 0.0;
//#pragma omp barrier
for (v = 0; v < 19; v++)
{
ff[fineindex(i, j, k, a)] += Minv2[a][v] * rf[v];
}
}
}
}
}
}
Your outer parallel region will execute the inner parallel for region multiple times. Assume there are 8 cores on your machine, the loops will be calculated 8 times, compared to the sequential version only running loops once.
ff is implicitly shared between threads in the parallel region. Therefore, during the computation, data race may exist for ff[fineindex(i, j, k, a)]. Since 8 threads are working on ff at the same time, when two threads try to write to the same index of ff, it may lead to an unpredicted result.
To resolve this issue, you may use omp for instead of omp parallel for for the loops. omp for is just used for worksharing, which distributes your loop iterations to the threads in the outer parallel region. It will not start another parallel region. In this way, each thread in the outer parallel will handle different loop iterations.
Related
I'm trying to parallelize the following Radix Sort algorithm C code using OpenMP but I have some doubts about using the OpenMP clauses. In particular, there are some loops where I doubt that they can be parallelized at all.
Here is the code I'm working on:
unsigned getMax(size_t n, unsigned arr[n]) {
unsigned mx = arr[0];
unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++)
if (arr[i] > mx)
mx = arr[i];
return mx;
}
void countSort(size_t n, unsigned arr[n], unsigned exp) {
unsigned output[n]; // output array
int i, count[10] = { 0 };
// Store count of occurrences in count[]
#pragma omp parallel for private(i)
for (i = 0; i < n; i++) {
#pragma omp atomic
count[(arr[i] / exp) % 10]++; }
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
// Build the output array
#pragma omp parallel for private(i)
for (i = (int) n - 1; i >= 0; i--) {
#pragma omp atomic write
output[count[(arr[i] / exp) % 10] - 1] = arr[i];
count[(arr[i] / exp) % 10]--;
}
#pragma omp parallel for private(i)
for (i = 0; i < n; i++)
arr[i] = output[i];
}
// The main function to that sorts arr[] of size n using Radix Sort
void radixsort(size_t n, unsigned arr[n], int threads) {
omp_set_num_threads(threads);
unsigned m = getMax(n, arr);
unsigned exp;
for (exp = 1; m / exp > 0; exp *= 10)
countSort(n, arr, exp);
}
In particular, I'm not sure if for loops like the following can be parallelized or not:
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
#pragma omp parallel for private(i)
for (i = (int) n - 1; i >= 0; i--) {
#pragma omp atomic write
output[count[(arr[i] / exp) % 10] - 1] = arr[i];
count[(arr[i] / exp) % 10]--;
}
I'm asking for help on the specific OMP clauses I should use; other comments on the code shown are also welcome.
First of all to parallelize a code a reasonable amount of work is needed, otherwise the parallel overheads are bigger than the gain by parallelization. This is definitely the case in your example, since you create the output array on stack (so it cannot be big enough). Comments on your code:
Both loops you mention in your question depend on the order of execution, so they cannot be parallelized easily/efficiently. Note also that there is a race condition when count array is accessed.
If you select a base which is a power of 2 (2^k), you can get rid off expensive integer division and you can use fast bitwise/shift operators instead.
Always define your variables in their minimal required scope. So instead of
unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++) ....
the following code is preferred:
#pragma omp parallel for reduction(max:mx)
for (unsigned i = 1; i < n; i++) ....
To copy your array, memcpy can be used: memcpy(arr,output,n*sizeof(output[0]))
In this loop
#pragma omp parallel for private(i)
for (i = 0; i < n; i++) {
#pragma omp atomic
count[(arr[i] / exp) % 10]++; }
you can use reduction instead of atomic operation:
#pragma omp parallel for private(i) reduction(+:count[10])
for (i = 0; i < n; i++) {
count[(arr[i] / exp) % 10]++; }
Radix sort can be parallelized if you split up the data. One way to do this is to use a most significant digit radix sort for the first pass, to create multiple logical bins. For example, if using base 256 (2^8), you end up with 256 bins, which radix sort can then sort in parallel, based on the number of logical cores on your system. With 4 cores, you can sort 4 bins at a time. This relies on having somewhat uniform distribution of the most significant digit, so that the bins are somewhat equal in size.
Trying to optimize the first pass may not help much, since you'll need atomic read|write for the to update a bin index, and the random access writes to anywhere in the destination array will create cache conflicts.
I am very new to openMP, but am trying to write a simple program that generates the entries of matrix in parallel, namely for the N by M matrix A, let A(i,j) = i*j. A minimal example is included below:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc,
char **argv)
{
int i, j, N, M;
N = 20;
M = 20;
int* A;
A = (int*) calloc(N*M, sizeof(int));
// compute entries of A in parallel
#pragma omp parallel for shared(A)
for (i = 0; i < N; ++i){
for (j = 0; j < M; ++j){
A[i*M + j] = i*j;
}
}
// print parallel results
for (i = 0; i < N; ++i){
for (j = 0; j < M; ++j){
printf("%d ", A[i*M + j]);
}
printf("\n");
}
free(A);
return 0;
}
The results are not always correct. In theory, I am only parallelizing the outer loop, and each iteration of the for loop does not modify the entries that the other iterations will modify. But I am not sure how to translate this to openMP. When doing a similar procedure for a vector array (i.e. just one for loop), there seems to be no issue, e.g.
#pragma omp parallel for
for (i = 0; i < N; ++i)
{
v[i] = i*i;
}
Can someone explain to me how to fix this?
The issue in this case is that j is shared between threads which messes with the control flow of the inner loop. By default variables declared outside of a parallel region are shared whereas variables declared inside of a parallel region are private.
Follow the general rule to declare variables as locally as possible. In the for loop this means:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {
This makes reasoning about your code much easier - and OpenMP code mostly correct by default. (Note A is shared by default because it is defined outside).
Alternatively you can manually specify private(i,j) shared(A) - this is more explicit and can help beginners. However it creates redundancy and can also be dangerous: private variables are uninitialized even if they had a valid value outside of the parallel region. Therefore I strongly recommend the implicit default approach unless necessary for advanced usage.
According to e.g. this
http://supercomputingblog.com/openmp/tutorial-parallel-for-loops-with-openmp/
The declaration of variables outside of a parallelized part is dangerous.
It can be defused by explicitly making the loop variable of the inner loop private.
For that, change this
#pragma omp parallel for shared(A)
to
#pragma omp parallel for private(j) shared(A)
I'm trying to optimize a program as an experiment.
When I parallelized the first two outer loops(with "it" and "i") I saw significant difference on execution time. But when I tried to parallelize the inner most loop the program became much slower than sequential one. I also tried using reduction but the result was the same.
Is this something that I should expect or I made a mistake on the parallelization?
When I use the "nowait" clause it runs faster than the other two previous parallelizations.
#pragma omp parallel private(it,i,j) firstprivate(u,sigma,dt,mu)
{
for (it = 0; it < itime; it++) {
for (i = 0; i < n; i++) {
sum = 0.0;
#pragma omp for schedule(static)
for (j = 0; j < n; j+=1) {
sum += sigma[i * n + j] * (u[j] - u[i]);
}
#pragma omp atomic write
uplus[i]= (u[i] + dt * (mu - u[i])) + dt * sum / divide;
if (u[i] > uth) {
#pragma omp atomic write
uplus[i] = 0.0;
if (it >= ttransient) {
#pragma omp atomic
omega1[i] += 1.0;
}
}
}
}//omp end
Why is the parallel application taking more time to execute than the one with the single thread? I am using an 8 CPU computer with Ubuntu 14.04. The code is just my simple way to test omp parallel sections, the aim later is to run two different functions in two different threads at the same time, so I do not want to use #pragma omp parallel for.
parallel:
int main()
{
int k = 0;
int m = 0;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for( k = 0; k < 1e9; k++ ){};
}
#pragma omp section
{
for( m = 0; m < 1e9; m++ ){};
}
}
}
return 0;
}
and the single thread:
int main()
{
int m = 0;
int k = 0;
for( k = 0; k < 1e9; k++ ){};
for( m = 0; m < 1e9; m++ ){};
return 0;
}
If the compiler would not optimise the loops, then the parallel code would suffer from false sharing because m and k are very likely to end up in the same cache line. Make the variables private:
#pragma omp parallel private(k,m)
{
#pragma omp sections
{
#pragma omp section
{
for( k = 0; k < 1e9; k++ ){};
}
#pragma omp section
{
for( m = 0; m < 1e9; m++ ){};
}
}
}
At high optimisation levels, the compiler could drop the loops altogether. But then the parallel version will still have the added overhead of spawning the OpenMP worker threads and joining them afterwards, which will make it slower than the sequential version.
In above test code compiler itself optimizing the code. You need to change your test code. Depending on number of thread you are creating also add an overhead.
Also refer, Amdahl’s Law.
Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.
The problem:
We have an n*m matrix, and we want to copy elements from previous row as in the following code:
for (i = 1; i < n; i++)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
Approach:
Outer loop iterations have to be executed in order, they would be executed sequentially.
Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.
#pragma omp parallel private(j)
{
for (i = 1; i < n; i++)
{
#pragma omp for scheduled(dynamic)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain.
I am looking to solution for the scenario above, even if I had to use some workaround.
I am adding my actual code. This is is actually slower than seq. version. Please review:
/* load input */
for (i = 1; i <= n; i++)
scanf ("%d %d", &in[i][W], &in[i][V]);
/* init */
for (i = 0; i <= wc; i++)
a[0][i] = 0;
/* compute */
#pragma omp parallel private(i,w)
{
for(i = 1; i <= n; ++i) // 1 000 000
{
j=i%2;
jn = j == 1 ? 0 : 1;
#pragma omp for
for(w = 0; w <= in[i][W]; w++) // 1000
a[j][w] = a[jn][w];
#pragma omp for
for(w = in[i][W]+1; w <= wc; w++) // 350 000
a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
}
}
As for measuring, I am using something like this:
double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;
To sum up the parallelization in OpenMP for this particular case: It is not worth it.
Why?
Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body.
Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was.
I also found out, there is no real performance gain in such construct:
#pragma omp parallel private(i,j)
{
for (i = 1; i < n; i++)
{
#pragma omp for
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
because it's performance is similar to this one
for (i = 1; i < n; i++)
{
#pragma omp parallel for private(j)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/
Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.