I'm trying to optimize my C/Fortran code for calculating finite-difference gradients using OpenMP. The code outputs the correct values for both serial and multithreaded cases. However, when I try to store the calculated values in an array, the code slows down a lot, vs just performing computations.
I wrote a simplified example for asking this question:
I allocate my arrays in C:
/* phi - field over which derivatives are calculated */
float *phi = (float *) calloc(SIZE, sizeof(float));
/* rhs - derivatives in each direction are summed and
stored in this variable
*/
float *rhs = (float *) calloc(SIZE, sizeof(float));
My array size is 256^3, and I have 3 cells as "padding" on each end, which makes "SIZE" as 262^3.
Then I call my Fortran function inside the parallel OMP region, dividing the work over different threads equally in the k-direction:
#pragma omp parallel default(none) shared(phi, rhs)
{
/* Divide up slices in the z-direction over the available threads */
cur_thread = omp_get_thread_num();
num_threads = omp_get_num_threads();
nslices = khi_fb - klo_fb + 1;
/* Current lo and hi indices in the z-direction */
cur_klo_fb = klo_fb + nslices * cur_thread/num_threads;
cur_khi_fb = klo_fb + nslices * (cur_thread + 1)/num_threads - 1;
if (cur_khi_fb > khi_fb)
cur_khi_fb = khi_fb;
/* Start timing the program */
time0 = omp_get_wtime();
/* Run 100 times for better timing */
for (i = 0; i < 100; i++)
{
/* Calling Fortran routine for calculating derivatives */
CALC_DERIV(phi, rhs,
&(ilo_gb), &(ihi_gb), &(jlo_gb), &(jhi_gb), &(klo_gb), &(khi_gb),
&(ilo_fb), &(ihi_fb), &(jlo_fb), &(jhi_fb),
&(cur_klo_fb), &(cur_khi_fb),
&dx);
}
time1 = omp_get_wtime();
In my Fortran routine, I iterate over the array, calculating the central difference in each direction at each point:
do k=klo_fb,khi_fb
do j=jlo_fb,jhi_fb
do i=ilo_fb,ihi_fb
phi_x = (phi(i+1,j,k) - phi(i-1,j,k))*dx_factor
phi_y = (phi(i,j+1,k) - phi(i,j-1,k))*dy_factor
phi_z = (phi(i,j,k+1) - phi(i,j,k-1))*dz_factor
temp = rhs(i,j,k) + phi_x + phi_y + phi_z
rhs(i,j,k) = temp
enddo
enddo
enddo
Here, the suffix "_fb" refers to fill box. So the iterations run from 3:258 in i, and j directions, and over 'cur_klo_fb' and 'cur_khi_fb' in the k-direction.
Here's my problem: when I run the code as shown, the timing for 1 thread (serial behavior) is ~2.17s. When I comment out the line rhs(i,j,k) = temp my timing is 3e-4s. Why is there such a big difference? I am doing the same number of computations. The only difference is I am storing the temporary variable 'temp' in a given array location in rhs. Moreover, I am reading the array 'phi' plenty of times, but it doesn't seem to affect the speed. It seems writing 'temp' to the array 'rhs' slows things down.
When I run with OpenMP, I get a speedup, but it is not optimal. I guess I am missing something.
I hope I explained my problem clearly. I would be happy to provide the complete code I am testing.
EDIT 2:
So based on the comments, I experimented with seeing if the compiler is actually just ignoring the loops altogether.
I have further modified the post, to include suggestions by #jabirali and #VladimirF. I have added timings too.
So I modified the Fortran loops:
integer count
count = 1
c { begin loop over grid
do k=klo_fb,khi_fb
do j=jlo_fb,jhi_fb
do i=ilo_fb,ihi_fb
phi_x = (phi(i+1,j,k) - phi(i-1,j,k))*dx_factor
phi_y = (phi(i,j+1,k) - phi(i,j-1,k))*dy_factor
phi_z = (phi(i,j,k+1) - phi(i,j,k-1))*dz_factor
temp = rhs(i,j,k) + phi_x + phi_y + phi_z
temp2 = temp2 + temp
c rhs(i,j,k) = temp
enddo
enddo
enddo
c } end loop over grid
Here are the timings for various cases (for 1 and 2 threads):
Case 1: With temp2, no rhs(i,j,k), 1 thread: 3.24s
Case 2: With temp2, no rhs(i,j,k), 2 threads: 1.65s
Case 3: Without temp2, with rhs(i,j,k), 1 thread: 1.23s
Case 4: Without temp2, with rhs(i,j,k), 2 threads: 0.74s
Case 5: Without temp2, without rhs, 1 thread: 1e-6s
Still confused :(.
Related
I'm trying to get some experience with OpenCL, the environment is setup and I can create and execute kernels. I am currently trying to compute pi in parallel using the Leibniz formula but have been receiving some strange results.
The kernel is as follow:
__kernel void leibniz_cl(__global float *space, __global float *result, int chunk_size)
{
__local float pi[THREADS_PER_WORKGROUP];
pi[get_local_id(0)] = 0.;
for (int i = 0; i < chunk_size; i += THREADS_PER_WORKGROUP) {
// `idx` is the work item's `i` in the grander scheme
int idx = (get_group_id(0) * chunk_size) + get_local_id(0) + i;
float idx_f = 1 / ((2 * (float) idx) + 1);
// Make the fraction negative if needed
if(idx & 1)
idx_f = -idx_f;
pi[get_local_id(0)] += idx_f;
}
// Reduction within workgroups (in `pi[]`)
for(int groupsize = THREADS_PER_WORKGROUP / 2; groupsize > 0; groupsize >>= 1) {
if (get_local_id(0) < groupsize)
pi[get_local_id(0)] += pi[get_local_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
If I end the function here and set result to pi[get_local_id(0)] for !get_global_id(0) (as in the reduction for the first group), printing result prints -nan.
Remainder of kernel:
// Reduction amongst workgroups (into `space[]`)
if(!get_local_id(0)) {
space[get_group_id(0)] = pi[get_local_id(0)];
for(int groupsize = get_num_groups(0) / 2; groupsize > 0; groupsize >>= 1) {
if(get_group_id(0) < groupsize)
space[get_group_id(0)] += space[get_group_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_global_id(0) == 0)
*result = space[get_group_id(0)] * 4;
}
Returning space[get_group_id(0)] * 4 returns either -nan or a very large number which clearly is not an approximation of pi.
I can't decide if it is an OpenCL concept I'm missing or a parallel execution one in general. Any help is appreciated.
Links
Reduction template: OpenCL float sum reduction
Leibniz Formula: https://www.wikiwand.com/en/Leibniz_formula_for_%CF%80
Maybe these are not most critical issues with the code but they can be the source of problem:
You definetly should use barrier(CLK_LOCAL_MEM_FENCE); before local reduction. This can be avoided if only you know that work group size is equal or smaller than number of threads in wavefront running same instruction in parallel - 64 for AMD GPUs, 32 for NVidia GPUs.
Global reduction must be done in multiple launches of kernel because barrier() works for work items of same work group only. Clear and 100% working way to insert a barrier into kernel is splittion it in two in the place where global barier is needed.
I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.
I have read all the relevant questions I found, but still I could not find a solution to my issue, where I have a function, with a double for loop, that is the bottleneck of my program.
The code is designed wrt MPI:
Having a big matrix which I scatter among p processes.
Every process now has a submatrix.
Every process is calling update() in a loop.
When the loop is terminated, master process gathers results.
Now I would like to augment my MPI code with OpenMp to get faster execution, by taking advantage of the double for loop of update().
void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
int i = 1, j = 1, end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0) {
i = 2;
end_i = subN - 1;
} else if ( id / pSqrt == (pSqrt - 1) ) {
i = 1;
end_i = subN - 2;
}
#pragma omp parallel for
for ( ; i < end_i; ++i) {
if (id % pSqrt == 0) {
j = 2;
end_j = subN - 1;
} else if ((id + 1) % pSqrt == 0) {
j = 1;
end_j = subN - 2;
}
#pragma omp parallel for
for ( ; j < end_j; ++j) {
gridNextPtr[i][j] = gridPtr[i][j] +
parms.cx * (gridPtr[i+1][j] +
gridPtr[i-1][j] -
2.0 * gridPtr[i][j]) +
parms.cy * (gridPtr[i][j+1] +
gridPtr[i][j-1] -
2.0 * gridPtr[i][j]);
}
}
}
I am running this on 2 computers, where every computer has 2 CPUs. I am using 4 processes. However, I see no speedup with and without OpenMp. Any ideas please? I am compiling with -O1 optimization flag.
High-level analysis
It is a common fallacy that hybrid programming (e.g. MPI+OpenMP) is a good idea. This fallacy is widely espoused by so-called HPC experts, many of whom are paper pushers at supercomputing centers and do not write much code. An expert take-down of the MPI+Threads fallacy is Exascale Computing without Threads.
This is not to say that flat MPI is the best model. For example, MPI experts espouse a two-level MPI-only approach in Bi-modal MPI and MPI+threads Computing on Scalable Multicore Systems and MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory (free version). In the so-called MPI+MPI model, the programmer exploits shared-memory coherence domains using MPI shared-memory instead of OpenMP but with a private by default data model, which reduces the incidence of race conditions. Furthermore, MPI+MPI uses only one runtime system, which makes resource management, and process topology/affinity easier. In contrast, MPI+OpenMP requires one to use either the fundamentally a non-scalable fork-join execution model with threads (i.e. make MPI calls between OpenMP parallel regions) or enable MPI_THREAD_MULTIPLE in order to make MPI calls within threaded regions - MPI_THREAD_MULTIPLE entails noticeable overhead on today's platforms.
This topic could cover many pages, which I do not have time to write at the moment, so please see the cited links.
Reducing OpenMP runtime overheads
One reason that MPI+OpenMP does not perform as well as pure MPI is that OpenMP runtime overheads have a tendency to appear in too many places. One type of unnecessary runtime overhead is from nested parallelism. Nested parallelism occurs when one nests on omp parallel construct inside another. Most programmers do not know that a parallel region is a relatively expensive construct and one should try to minimize them. Furthermore, omp parallel for is the fusion of two constructions - parallel and for - and one should really try to think of these independently. Ideally, you create one parallel region that contains many worksharing constructs, such as for, sections, etc.
Below is your code modified to use only one parallel region and parallelism across both for loops. Because collapse requires perfect nesting (nothing between the two for loops), I had to move some code inside. However, nothing stops the compiler from hoisting this loop invariant back out after the OpenMP lowering (this is a compiler concept, which you can ignore), so that code might still only execute end_i times, rather than end_i*end_j times.
Update: I've adapted the code from the other answer to demonstrate collapse instead.
There are a variety of ways to parallelize these two loops with OpenMP. Below you can see four versions, of all of which are compliant with OpenMP 4. Version 1 is likely to be the best, at least in current compilers. Version 2 uses collapse but not simd (it is compliant with Open 3). Version 3 could be the best, but is harder to implement ideally and does not lead to SIMD code generation with some compilers. Version 4 parallelizes only the outer loop.
You should experiment to see which one of these options is the fastest for your application.
#if VERSION==1
#define OUTER _Pragma("omp parallel for")
#define INNER _Pragma("omp simd")
#elif VERSION==2
#define OUTER _Pragma("omp parallel for collapse(2)")
#define INNER
#elif VERSION==3
#define OUTER _Pragma("omp parallel for simd collapse(2)")
#define INNER
#elif VERSION==4
#define OUTER _Pragma("omp parallel for simd")
#define INNER
#else
#error Define VERSION
#define OUTER
#define INNER
#endif
struct {
float cx;
float cy;
} parms;
void update (int pSqrt, int id, int subN, const float * restrict gridPtr[restrict], float * restrict gridNextPtr[restrict])
{
int beg_i = 1, beg_j = 1;
int end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0 ) {
beg_i = 2;
} else if ( id / pSqrt == (pSqrt - 1) ) {
end_i = subN - 2;
}
if (id % pSqrt == 0) {
beg_j = 2;
} else if ((id + 1) % pSqrt == 0) {
end_j = subN - 2;
}
OUTER
for ( int i = beg_i; i < end_i; ++i ) {
INNER
for ( int j = beg_j; j < end_j; ++j ) {
gridNextPtr[i][j] = gridPtr[i][j] + parms.cx * (gridPtr[i+1][j] + gridPtr[i-1][j] - 2 * gridPtr[i][j])
+ parms.cy * (gridPtr[i][j+1] + gridPtr[i][j-1] - 2 * gridPtr[i][j]);
}
}
}
The example code above is correct with the following compilers:
GCC 5.3.0
Clang-OpenMP 3.5.0
Cray C 8.4.2
Intel 16.0.1.
It will not compile with PGI 11.7, both because of [restrict] (replacing it with [] is sufficient) and the OpenMP simd clause. This compiler lacks full support for C99 contrary to this presentation. It's not too surprising that it is not compliant with OpenMP, given it was released in 2011. Unfortunately, I do not have access to a newer version.
What about this version (not tested)?
Please compile it and test it. If it works better, I'll explain it more.
BTW, using some more aggressive compiler options might help too.
void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
int beg_i = 1, beg_j = 1;
int end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0 ) {
beg_i = 2;
} else if ( id / pSqrt == (pSqrt - 1) ) {
end_i = subN - 2;
}
if (id % pSqrt == 0) {
beg_j = 2;
} else if ((id + 1) % pSqrt == 0) {
end_j = subN - 2;
}
#pragma omp parallel for schedule(static)
for ( int i = beg_i; i < end_i; ++i ) {
#pragma omp simd
for ( int j = beg_j; j < end_j; ++j ) {
gridNextPtr[i][j] = gridPtr[i][j] +
parms.cx * (gridPtr[i+1][j] +
gridPtr[i-1][j] -
2.0 * gridPtr[i][j]) +
parms.cy * (gridPtr[i][j+1] +
gridPtr[i][j-1] -
2.0 * gridPtr[i][j]);
}
}
}
EDIT: Some explanations on what I did to the code...
The initial version was using nested parallelism for no reason at all (a parrallel region nested within another parallel region). This was likely to be very counter effective and I simply removed it.
The loop indexes i and j were declared and initialised outside of the for loop statements. This is error-prone at two level: 1/ it might forces to declare their parallel scope (private) whereas having them within the for statement automatically give them the right one; and 2/ you can have mix-up by erroneously reusing the indexes outside of the loops. Moving them into the for statement was easy.
You were changing the boundaries of the j loops inside the parallel region for no good reason. You would have had to declare end_j private. Moreover, it was a potential limitation for further developments (like potential use of the collapse(2) directive), as it was breaking the rules for a canonical loop form as defined in the OpenMP standard. So defining some beg_i and beg_j outside of the parallel region was making sense, sparing computations and simplifying the form of the loops, keeping them canonical.
Now from there, the code was suitable for vectorisation, and the addition of a simple simd directive on the j loop would enforce it, should the compiler fail to see the possible vectorisation by itself.
I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?
Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.
Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}
I wanted to optimize below code using openMP
double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;
#define _MSE(m, t) \
val = refData[t] - calData[t]; \
m += val*val;
#pragma omp parallel
{
#pragma omp for
for( i=0; i<(width*height)/2; i++ ) { //yuv422: 2 pixels at a time
_MSE(m_u, 0);
_MSE(m_y, 1);
_MSE(m_v, 2);
_MSE(m_y, 3);
#pragma omp reduction(+:refData) reduction(+:calData)
refData += 4;
calData += 4;
// int id = omp_get_thread_num();
//printf("Thread %d performed %d iterations of the loop\n",id ,i);
}
}
Any suggestion welcome for optimizing above code currently I have wrong output.
I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:
double sqError[4] = {0};
const int numBytes = width * height * 2;
#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
for( int i = elem; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
sqError[elem] += (double)(val*val);
}
}
This way, each thread operates exclusively on one thing and there is no contention.
Maybe it's not the most advanced use of OMP, but you should see a speedup.
After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.
You said:
performance hit this time with openMP : Time :0.040637 with serial
Time :0.018670
So I reworked it using the reduction on each variable and using a single loop:
#pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
for( int i = 0; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
e0 += (double)(val*val);
val = refData[i+1] - calData[i+1];
e1 += (double)(val*val);
val = refData[i+2] - calData[i+2];
e2 += (double)(val*val);
val = refData[i+3] - calData[i+3];
e3 += (double)(val*val);
}
With my test case on a 4-core machine, I observed a little less than 4-fold improvement:
serial: 2025 ms
omp with 2 loops: 6850 ms
omp with reduction: 455 ms
[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:
Your first piece of code is a terrible example of what false sharing
does in multithreaded codes. As sqError has only 4 elements of 8 bytes
each, it fits in a single cache line (even in a half cache line on
modern x86 CPUs). With 4 threads constantly writing to neighbouring
elements, this would generate a massive amount of inter-core cache
invalidation due to false sharing. One can get around this by using
instead a structure like this struct _error { double val; double
pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate
cache line, hence no false sharing.
The code looks like it's calculating the MSE but adding to the same sum, m. For parallelism to work properly, you need to eliminate sharing of m, one approach would be preallocating an array (width*height/2 I imagine) just to store the different sums, or ms. Finally, add up all the sums at the end.
Also, test that this is actually faster!