I am beginner in using OpenMP in C. I am trying to parallelize four nested loops. I've read that it's advisable to only parallelize the outer loop but it's taking a very long time.
What is the best way to parallelize the following
int nt=2500, nx=400; nz=200; nh=50;
#pragma omp parallel for
for(it=0; it<nt; it++)
for(ix=0; ix<nx; ix++)
for(iz=0; iz<nz; iz++)
for(ih=-nh; ih<=nh; ih++) {
if (ix+ih<nx && ix+ih>=0 && ix-ih<nx && ix-ih>=0 ) {
dR[it][ix+ih][iz] += ii[ih+nh][ix][iz]*us[it][ix-ih][iz];
dS[it][ix-ih][iz] += ii[ih+nh][ix][iz]*ur[it][ix+ih][iz];
}
}
As far as data races go, it is unsafe to parallelize loops in a way that causes the same memory location to be accessed by two different threads and at least one access is a write.
You never read and write to the same variable, so it should be safe to parallelize every loop. (Not necessarily more efficient though)
Your actual loops can also be rewritten.
Your if condition can be written logically as 0 <= ix+ih < nx && 0 <= ix-ih < nx, or in other words you only want to write between 0 and nx.
If we can show that the ranges of ix+ih and ix-ih is larger than 0 to nx we can eliminate the check and manually loop over those ranges.
Examining the loops we see that 0 < ix < nx and -nh < ih < nh which allows us to find the ranges of ix+ih and ix-ih.
ix+ih ranges from -nh to nx + nh and ix-ih ranges from -nh to nx+nh. Both of these ranges are larger than 0,nx as long as nh is positive and so we don't need to do a check at all. We can just loop from 0 to nx.
omp_set_nested(1);
#pragma omp parallel for
for(it=0; it<nt; it++) {
#pragma omp parallel for
for (iy = 0; iy < nx; iy++) {
#pragma omp parallel for
for(iz=0; iz<nz; iz++) {
dR[it][iy][iz] += ii[ih+nh][ix][iz] * us[it][ix-ih][iz] ;
dS[it][iy][iz] += ii[ih+nh][ix][iz] * ur[it][ix+ih][iz] ;
}
}
}
Related
I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}
The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).
I am trying to execute a code in C using OpenMP. The following is the code
#pragma omp parallel \
reduction(+:array[length])
{
int start = 1, distance, nthreads;
nthreads = omp_get_num_threads();
printf("%d\n", nthreads);
#pragma omp for
for (distance = 1; distance < length; distance = distance + distance)
{
for (i = length - 1; i >= start; i--)
{
array[i] = array[i] + array[i - distance];
}
start *=2;
}
}
The compiler is throwing the following error
**error**: increment expression refers to iteration variable ‘distance’
#pragma omp for
I tried to browse about this error online, but didn't find much. Any help in decoding the error would be useful.
Also, should the reduction clause be present on top top next to #pragma omp parallel \ or after #pragma omp for.
The OpenMP loop work-sharing construct requires a so called canonical loop form. You can only increment the loop variable by a loop-invariant value. You have to restructure your loop, e.g. through use of sqrt / <<. Also note that your use of start is not correct. Compute start from the loop iteration.
I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.
I am trying to parallelize the following nested "for loops" (in C) using OpenMP.
for (dt = 0; dt <= maxdt; dt++) {
for (t0 = 0; t0 <= nframes-dt; t0++) {
for (i=0; i<natoms; i++) {
VAC[dt] = VAC[dt] + dot_product(vect[t0][i],vect[t0+dt][i]) ;
}
}
}
Basically this calculates an auto-correlation function of a time dependent vector (vect). I need the VAC array as the final output using OpenMP.
I have tried using the reduction sum approach of OpenMP to perform this, by adding the following line above the innermost loop (for (i=0; i<natoms; i++)).
#pragma omp parallel for default(shared) private(i,axis) schedule(guided) reduction(+: VAC[dt])
But this does not work, since reduction sum does not work for arrays. What would be the best and most efficient way to parallelize such codes? Thanks.
I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP