No speedup with OpenMp and MPI over MPI only - c

I have read all the relevant questions I found, but still I could not find a solution to my issue, where I have a function, with a double for loop, that is the bottleneck of my program.
The code is designed wrt MPI:
Having a big matrix which I scatter among p processes.
Every process now has a submatrix.
Every process is calling update() in a loop.
When the loop is terminated, master process gathers results.
Now I would like to augment my MPI code with OpenMp to get faster execution, by taking advantage of the double for loop of update().
void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
int i = 1, j = 1, end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0) {
i = 2;
end_i = subN - 1;
} else if ( id / pSqrt == (pSqrt - 1) ) {
i = 1;
end_i = subN - 2;
}
#pragma omp parallel for
for ( ; i < end_i; ++i) {
if (id % pSqrt == 0) {
j = 2;
end_j = subN - 1;
} else if ((id + 1) % pSqrt == 0) {
j = 1;
end_j = subN - 2;
}
#pragma omp parallel for
for ( ; j < end_j; ++j) {
gridNextPtr[i][j] = gridPtr[i][j] +
parms.cx * (gridPtr[i+1][j] +
gridPtr[i-1][j] -
2.0 * gridPtr[i][j]) +
parms.cy * (gridPtr[i][j+1] +
gridPtr[i][j-1] -
2.0 * gridPtr[i][j]);
}
}
}
I am running this on 2 computers, where every computer has 2 CPUs. I am using 4 processes. However, I see no speedup with and without OpenMp. Any ideas please? I am compiling with -O1 optimization flag.

High-level analysis
It is a common fallacy that hybrid programming (e.g. MPI+OpenMP) is a good idea. This fallacy is widely espoused by so-called HPC experts, many of whom are paper pushers at supercomputing centers and do not write much code. An expert take-down of the MPI+Threads fallacy is Exascale Computing without Threads.
This is not to say that flat MPI is the best model. For example, MPI experts espouse a two-level MPI-only approach in Bi-modal MPI and MPI+threads Computing on Scalable Multicore Systems and MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory (free version). In the so-called MPI+MPI model, the programmer exploits shared-memory coherence domains using MPI shared-memory instead of OpenMP but with a private by default data model, which reduces the incidence of race conditions. Furthermore, MPI+MPI uses only one runtime system, which makes resource management, and process topology/affinity easier. In contrast, MPI+OpenMP requires one to use either the fundamentally a non-scalable fork-join execution model with threads (i.e. make MPI calls between OpenMP parallel regions) or enable MPI_THREAD_MULTIPLE in order to make MPI calls within threaded regions - MPI_THREAD_MULTIPLE entails noticeable overhead on today's platforms.
This topic could cover many pages, which I do not have time to write at the moment, so please see the cited links.
Reducing OpenMP runtime overheads
One reason that MPI+OpenMP does not perform as well as pure MPI is that OpenMP runtime overheads have a tendency to appear in too many places. One type of unnecessary runtime overhead is from nested parallelism. Nested parallelism occurs when one nests on omp parallel construct inside another. Most programmers do not know that a parallel region is a relatively expensive construct and one should try to minimize them. Furthermore, omp parallel for is the fusion of two constructions - parallel and for - and one should really try to think of these independently. Ideally, you create one parallel region that contains many worksharing constructs, such as for, sections, etc.
Below is your code modified to use only one parallel region and parallelism across both for loops. Because collapse requires perfect nesting (nothing between the two for loops), I had to move some code inside. However, nothing stops the compiler from hoisting this loop invariant back out after the OpenMP lowering (this is a compiler concept, which you can ignore), so that code might still only execute end_i times, rather than end_i*end_j times.
Update: I've adapted the code from the other answer to demonstrate collapse instead.
There are a variety of ways to parallelize these two loops with OpenMP. Below you can see four versions, of all of which are compliant with OpenMP 4. Version 1 is likely to be the best, at least in current compilers. Version 2 uses collapse but not simd (it is compliant with Open 3). Version 3 could be the best, but is harder to implement ideally and does not lead to SIMD code generation with some compilers. Version 4 parallelizes only the outer loop.
You should experiment to see which one of these options is the fastest for your application.
#if VERSION==1
#define OUTER _Pragma("omp parallel for")
#define INNER _Pragma("omp simd")
#elif VERSION==2
#define OUTER _Pragma("omp parallel for collapse(2)")
#define INNER
#elif VERSION==3
#define OUTER _Pragma("omp parallel for simd collapse(2)")
#define INNER
#elif VERSION==4
#define OUTER _Pragma("omp parallel for simd")
#define INNER
#else
#error Define VERSION
#define OUTER
#define INNER
#endif
struct {
float cx;
float cy;
} parms;
void update (int pSqrt, int id, int subN, const float * restrict gridPtr[restrict], float * restrict gridNextPtr[restrict])
{
int beg_i = 1, beg_j = 1;
int end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0 ) {
beg_i = 2;
} else if ( id / pSqrt == (pSqrt - 1) ) {
end_i = subN - 2;
}
if (id % pSqrt == 0) {
beg_j = 2;
} else if ((id + 1) % pSqrt == 0) {
end_j = subN - 2;
}
OUTER
for ( int i = beg_i; i < end_i; ++i ) {
INNER
for ( int j = beg_j; j < end_j; ++j ) {
gridNextPtr[i][j] = gridPtr[i][j] + parms.cx * (gridPtr[i+1][j] + gridPtr[i-1][j] - 2 * gridPtr[i][j])
+ parms.cy * (gridPtr[i][j+1] + gridPtr[i][j-1] - 2 * gridPtr[i][j]);
}
}
}
The example code above is correct with the following compilers:
GCC 5.3.0
Clang-OpenMP 3.5.0
Cray C 8.4.2
Intel 16.0.1.
It will not compile with PGI 11.7, both because of [restrict] (replacing it with [] is sufficient) and the OpenMP simd clause. This compiler lacks full support for C99 contrary to this presentation. It's not too surprising that it is not compliant with OpenMP, given it was released in 2011. Unfortunately, I do not have access to a newer version.

What about this version (not tested)?
Please compile it and test it. If it works better, I'll explain it more.
BTW, using some more aggressive compiler options might help too.
void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
int beg_i = 1, beg_j = 1;
int end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0 ) {
beg_i = 2;
} else if ( id / pSqrt == (pSqrt - 1) ) {
end_i = subN - 2;
}
if (id % pSqrt == 0) {
beg_j = 2;
} else if ((id + 1) % pSqrt == 0) {
end_j = subN - 2;
}
#pragma omp parallel for schedule(static)
for ( int i = beg_i; i < end_i; ++i ) {
#pragma omp simd
for ( int j = beg_j; j < end_j; ++j ) {
gridNextPtr[i][j] = gridPtr[i][j] +
parms.cx * (gridPtr[i+1][j] +
gridPtr[i-1][j] -
2.0 * gridPtr[i][j]) +
parms.cy * (gridPtr[i][j+1] +
gridPtr[i][j-1] -
2.0 * gridPtr[i][j]);
}
}
}
EDIT: Some explanations on what I did to the code...
The initial version was using nested parallelism for no reason at all (a parrallel region nested within another parallel region). This was likely to be very counter effective and I simply removed it.
The loop indexes i and j were declared and initialised outside of the for loop statements. This is error-prone at two level: 1/ it might forces to declare their parallel scope (private) whereas having them within the for statement automatically give them the right one; and 2/ you can have mix-up by erroneously reusing the indexes outside of the loops. Moving them into the for statement was easy.
You were changing the boundaries of the j loops inside the parallel region for no good reason. You would have had to declare end_j private. Moreover, it was a potential limitation for further developments (like potential use of the collapse(2) directive), as it was breaking the rules for a canonical loop form as defined in the OpenMP standard. So defining some beg_i and beg_j outside of the parallel region was making sense, sparing computations and simplifying the form of the loops, keeping them canonical.
Now from there, the code was suitable for vectorisation, and the addition of a simple simd directive on the j loop would enforce it, should the compiler fail to see the possible vectorisation by itself.

Related

Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}
The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

What is the best way to parellize using openMP in c

I am beginner in using OpenMP in C. I am trying to parallelize four nested loops. I've read that it's advisable to only parallelize the outer loop but it's taking a very long time.
What is the best way to parallelize the following
int nt=2500, nx=400; nz=200; nh=50;
#pragma omp parallel for
for(it=0; it<nt; it++)
for(ix=0; ix<nx; ix++)
for(iz=0; iz<nz; iz++)
for(ih=-nh; ih<=nh; ih++) {
if (ix+ih<nx && ix+ih>=0 && ix-ih<nx && ix-ih>=0 ) {
dR[it][ix+ih][iz] += ii[ih+nh][ix][iz]*us[it][ix-ih][iz];
dS[it][ix-ih][iz] += ii[ih+nh][ix][iz]*ur[it][ix+ih][iz];
}
}
As far as data races go, it is unsafe to parallelize loops in a way that causes the same memory location to be accessed by two different threads and at least one access is a write.
You never read and write to the same variable, so it should be safe to parallelize every loop. (Not necessarily more efficient though)
Your actual loops can also be rewritten.
Your if condition can be written logically as 0 <= ix+ih < nx && 0 <= ix-ih < nx, or in other words you only want to write between 0 and nx.
If we can show that the ranges of ix+ih and ix-ih is larger than 0 to nx we can eliminate the check and manually loop over those ranges.
Examining the loops we see that 0 < ix < nx and -nh < ih < nh which allows us to find the ranges of ix+ih and ix-ih.
ix+ih ranges from -nh to nx + nh and ix-ih ranges from -nh to nx+nh. Both of these ranges are larger than 0,nx as long as nh is positive and so we don't need to do a check at all. We can just loop from 0 to nx.
omp_set_nested(1);
#pragma omp parallel for
for(it=0; it<nt; it++) {
#pragma omp parallel for
for (iy = 0; iy < nx; iy++) {
#pragma omp parallel for
for(iz=0; iz<nz; iz++) {
dR[it][iy][iz] += ii[ih+nh][ix][iz] * us[it][ix-ih][iz] ;
dS[it][iy][iz] += ii[ih+nh][ix][iz] * ur[it][ix+ih][iz] ;
}
}
}

OpenMP: Parallelize for loops inside a while loop

I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.

optimize MSE algorithm using openmp

I wanted to optimize below code using openMP
double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;
#define _MSE(m, t) \
val = refData[t] - calData[t]; \
m += val*val;
#pragma omp parallel
{
#pragma omp for
for( i=0; i<(width*height)/2; i++ ) { //yuv422: 2 pixels at a time
_MSE(m_u, 0);
_MSE(m_y, 1);
_MSE(m_v, 2);
_MSE(m_y, 3);
#pragma omp reduction(+:refData) reduction(+:calData)
refData += 4;
calData += 4;
// int id = omp_get_thread_num();
//printf("Thread %d performed %d iterations of the loop\n",id ,i);
}
}
Any suggestion welcome for optimizing above code currently I have wrong output.
I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:
double sqError[4] = {0};
const int numBytes = width * height * 2;
#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
for( int i = elem; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
sqError[elem] += (double)(val*val);
}
}
This way, each thread operates exclusively on one thing and there is no contention.
Maybe it's not the most advanced use of OMP, but you should see a speedup.
After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.
You said:
performance hit this time with openMP : Time :0.040637 with serial
Time :0.018670
So I reworked it using the reduction on each variable and using a single loop:
#pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
for( int i = 0; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
e0 += (double)(val*val);
val = refData[i+1] - calData[i+1];
e1 += (double)(val*val);
val = refData[i+2] - calData[i+2];
e2 += (double)(val*val);
val = refData[i+3] - calData[i+3];
e3 += (double)(val*val);
}
With my test case on a 4-core machine, I observed a little less than 4-fold improvement:
serial: 2025 ms
omp with 2 loops: 6850 ms
omp with reduction: 455 ms
[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:
Your first piece of code is a terrible example of what false sharing
does in multithreaded codes. As sqError has only 4 elements of 8 bytes
each, it fits in a single cache line (even in a half cache line on
modern x86 CPUs). With 4 threads constantly writing to neighbouring
elements, this would generate a massive amount of inter-core cache
invalidation due to false sharing. One can get around this by using
instead a structure like this struct _error { double val; double
pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate
cache line, hence no false sharing.
The code looks like it's calculating the MSE but adding to the same sum, m. For parallelism to work properly, you need to eliminate sharing of m, one approach would be preallocating an array (width*height/2 I imagine) just to store the different sums, or ms. Finally, add up all the sums at the end.
Also, test that this is actually faster!

unbalanced nested for loops in openmp

I've been trying to parallelize an algorithm with unbalanced nested for loops using OpenMP. I can't post the original code as it's a secret project of an unheard government but here's a toy example:
for (i = 0; i < 100; i++) {
#pragma omp parallel for private(j, k)
for (j = 0; j < 1000000; j++) {
for (k = 0; k < 2; k++) {
temp = i * j * k; /* dummy operation (don't mind the race) */
}
if (i % 2 == 0) temp = 0; /* so I can't use openmp collapse */
}
}
Currently this example is working slower in multiple threads (~1 sec in single thread ~2.4 sec in 2 threads etc.).
Things to note:
Outer for loop needs to be done in order (dependent on the previous step) (As far as I know, OpenMP handles inner loops well so threads don't get created/destroyed at each step, right?)
Typical index numbers are given in the example (100, 1000000, 2)
Dummy operation consists of just a few operations
There are some conditional operations outside the inner most loop so collapse is not an option (doesn't seem like it would increase the performance anyways)
Looks like an embarrassingly parallel algorithm but I can't seem to get any speedups for the last two days. What would be the best strategy here?
Unfortunately this embarrassingly parallel algorithm is an embarrassingly bad example of how performant parallelism should be implemented. And since my crystall ball tells me that besides i, temp is also a shared automatic variable, I would assume it for the rest of this text. It also tells me that you have a pre-Nehalem CPU...
There are two sources of slowdown here - code transformation and cache coherency.
The way parallel regions are implmentend is that their code is extracted in separate functions. Shared local variables are extracted into structures that are then shared between the threads in the team that executes the parallel region. Under the OpenMP transformations your code sample would become something similiar to this:
typedef struct {
int i;
int temp;
} main_omp_fn_0_shared_vars;
void main_omp_fn_0 (void *data) {
main_omp_fn_0_shared_vars *vars = data;
// compute values of j_min and j_max for this thread
for (j = j_min; j < j_max; j++) {
for (k = 0; k < 2; k++) {
vars->temp = vars->i * j * k;
if (vars->i % 2 == 0) vars->temp = 0;
}
}
int main (void) {
int i, temp;
main_omp_fn_0_shared_vars vars;
for (i = 0; i < 100; i++)
{
vars.i = i;
vars.temp = temp;
// This is how GCC implements parallel regions with libgomp
// Start main_omp_fn_0 in the other threads
GOMP_parallel_start(main_omp_fn_0, &vars, 0);
// Start main_omp_fn_0 in the main thread
main_omp_fn_0(&vars);
// Wait for other threads to finish (implicit barrier)
GOMP_parallel_end();
i = vars.i;
temp = vars.temp;
}
}
You pay a small penalty for accessing temp and i this way as their intermediate values cannot be stored in registers but are loaded and stored each time.
The other source of degradation is the cache coherency protocol. Accessing the same memory location from multiple threads executing on multiple CPU cores leads to lots of cache invalidation events. Worse, vars.i and vars.temp are likely to end up in the same cache line and although vars.i is only read from and vars.temp is only written to, full cache invalidation is likely to occur at each iteration of the inner loop.
Normally access to shared variables is protected by explicit synchronisation constructs like atomic statements and critical sections and performance degradation is well expected in that case.
Think of the overheads:
Since your outer loop needs to be in order you're creating x threads to perform the work in the inner loop, destroying them, then creating them again... and so on 100 times.
You have to wait until the longest task within the inner loop completes its work before performing the next step in the outer loop, so essentially this is a synchronization overhead. The tasks don't look irregular, but if the work to perform is small there's only so much speedup you can get out of this.
You have the cost of thread creation here, and allocating the private variables.
If the work inside the inner loop is small the benefits of parallelising this loop might not necessarily outweigh the cost of the parallelisation overheads above, hence you end up with a slowdown.

Resources