I am trying to parallelize the following nested "for loops" (in C) using OpenMP.
for (dt = 0; dt <= maxdt; dt++) {
for (t0 = 0; t0 <= nframes-dt; t0++) {
for (i=0; i<natoms; i++) {
VAC[dt] = VAC[dt] + dot_product(vect[t0][i],vect[t0+dt][i]) ;
}
}
}
Basically this calculates an auto-correlation function of a time dependent vector (vect). I need the VAC array as the final output using OpenMP.
I have tried using the reduction sum approach of OpenMP to perform this, by adding the following line above the innermost loop (for (i=0; i<natoms; i++)).
#pragma omp parallel for default(shared) private(i,axis) schedule(guided) reduction(+: VAC[dt])
But this does not work, since reduction sum does not work for arrays. What would be the best and most efficient way to parallelize such codes? Thanks.
Related
is it worth to parallelize true dependency loops? What could be the pros and cons? How much speedup can we get on average?
For example:
int sum = 0;
for(i=0;i<2000-1;i++){
for(j=0;j<2000;j++) {
curr[i][j] = some_value_here;
sum += curr[i][j];
}
}
How should I approach to this loop? there is a obvious RAW dependency, should I parallelize it? If so, how should I?
sum acts as a simple accumulator and this whole operation is a parallel reduction. The proper solution is to have each thread accumulate its own private sum and then add all private sums together at the end. OpenMP provides the reduction clause that does exactly that:
int sum = 0;
#pragma omp parallel for collapse(2) reduction(+:sum)
for(i=0;i<2000-1;i++){
for(j=0;j<2000;j++) {
curr[i][j] = some_value_here;
sum += curr[i][j];
}
}
reduction(+:sum) tells the compiler to create private copies of sum and then apply the + operator to reduce those private copies to a single value that is then added to the value sum had before the region. The code is roughly equivalent to:
int sum = 0;
#pragma omp parallel
{
int localsum = 0;
#pragma omp for collapse(2)
for(i=0;i<2000-1;i++) {
for(j=0;j<2000;j++) {
curr[i][j] = some_value_here;
localsum += curr[i][j];
}
}
#pragma omp atomic
sum += localsum;
}
The potential speedup here is equal to the number of execution units provided that you have one thread per execution unit and that there aren't that many threads so that the synchronous summation at the end of the parallel region takes negligible time.
I implemented the Dijkstra's algorithm In C. I'm trying to compare the runtime with and without using OpenMP, but for some reason OpenMP is always slower. I read something about how expensive new threads are but expanding the graph does not solve it. I would like to use omp here
#pragma omp parallel for
for(index=0; index<nodes[n].size; index++){
int ct = nodes[n].paths[index].connectsTo;
if(notVisited[ct]){
int dist = dis[n]+nodes[n].paths[index].weight;
if(dist<dis[ct]){
prev[ct] = n;
dis[ct] = dist;
}
}
}
I am beginner in using OpenMP in C. I am trying to parallelize four nested loops. I've read that it's advisable to only parallelize the outer loop but it's taking a very long time.
What is the best way to parallelize the following
int nt=2500, nx=400; nz=200; nh=50;
#pragma omp parallel for
for(it=0; it<nt; it++)
for(ix=0; ix<nx; ix++)
for(iz=0; iz<nz; iz++)
for(ih=-nh; ih<=nh; ih++) {
if (ix+ih<nx && ix+ih>=0 && ix-ih<nx && ix-ih>=0 ) {
dR[it][ix+ih][iz] += ii[ih+nh][ix][iz]*us[it][ix-ih][iz];
dS[it][ix-ih][iz] += ii[ih+nh][ix][iz]*ur[it][ix+ih][iz];
}
}
As far as data races go, it is unsafe to parallelize loops in a way that causes the same memory location to be accessed by two different threads and at least one access is a write.
You never read and write to the same variable, so it should be safe to parallelize every loop. (Not necessarily more efficient though)
Your actual loops can also be rewritten.
Your if condition can be written logically as 0 <= ix+ih < nx && 0 <= ix-ih < nx, or in other words you only want to write between 0 and nx.
If we can show that the ranges of ix+ih and ix-ih is larger than 0 to nx we can eliminate the check and manually loop over those ranges.
Examining the loops we see that 0 < ix < nx and -nh < ih < nh which allows us to find the ranges of ix+ih and ix-ih.
ix+ih ranges from -nh to nx + nh and ix-ih ranges from -nh to nx+nh. Both of these ranges are larger than 0,nx as long as nh is positive and so we don't need to do a check at all. We can just loop from 0 to nx.
omp_set_nested(1);
#pragma omp parallel for
for(it=0; it<nt; it++) {
#pragma omp parallel for
for (iy = 0; iy < nx; iy++) {
#pragma omp parallel for
for(iz=0; iz<nz; iz++) {
dR[it][iy][iz] += ii[ih+nh][ix][iz] * us[it][ix-ih][iz] ;
dS[it][iy][iz] += ii[ih+nh][ix][iz] * ur[it][ix+ih][iz] ;
}
}
}
I am trying to execute a code in C using OpenMP. The following is the code
#pragma omp parallel \
reduction(+:array[length])
{
int start = 1, distance, nthreads;
nthreads = omp_get_num_threads();
printf("%d\n", nthreads);
#pragma omp for
for (distance = 1; distance < length; distance = distance + distance)
{
for (i = length - 1; i >= start; i--)
{
array[i] = array[i] + array[i - distance];
}
start *=2;
}
}
The compiler is throwing the following error
**error**: increment expression refers to iteration variable ‘distance’
#pragma omp for
I tried to browse about this error online, but didn't find much. Any help in decoding the error would be useful.
Also, should the reduction clause be present on top top next to #pragma omp parallel \ or after #pragma omp for.
The OpenMP loop work-sharing construct requires a so called canonical loop form. You can only increment the loop variable by a loop-invariant value. You have to restructure your loop, e.g. through use of sqrt / <<. Also note that your use of start is not correct. Compute start from the loop iteration.
I'm trying to learn how to use OpenMP by parallelizing a monte carlo code that calculates the value of PI with a given number of iterations. The meat of the code is this:
int chunk = CHUNKSIZE;
count=0;
#pragma omp parallel shared(chunk,count) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for ( i=0; i<niter; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
}
}
pi=(double)count/niter*4;
printf("# of trials= %d , estimate of pi is %g \n",niter,pi);
Though this is not yielding the proper value for pi given 10,000 iterations. If all the OpenMP stuff is taken out, it works fine. I should mention that I used the monte carlo code from here: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html
I'm just using it to try to learn OpenMP. Any ideas why it's converging on 1.4ish? Can I not increment a variable with multiple threads? I'm guessing the problem is with the variable count.
Thanks!
Okay, I found the answer. I needed to use the REDUCTION clause. So all I had to modify was:
#pragma omp parallel shared(chunk,count) private(i)
to:
#pragma omp parallel shared(chunk) private(i,x,y,z) reduction(+:count)
Now it's converging at 3.14...yay