Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization - c

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}

The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

Related

Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
for(i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
}
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
for(i=0;i<nx;i++){
next[i] = (char*)(next+nx) + i*ny;
}
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(i=1;i<nx-1;i++){
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}
Is there any problematic mistake in my way? How can I improve this?
In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
}
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}

What is the best way to parellize using openMP in c

I am beginner in using OpenMP in C. I am trying to parallelize four nested loops. I've read that it's advisable to only parallelize the outer loop but it's taking a very long time.
What is the best way to parallelize the following
int nt=2500, nx=400; nz=200; nh=50;
#pragma omp parallel for
for(it=0; it<nt; it++)
for(ix=0; ix<nx; ix++)
for(iz=0; iz<nz; iz++)
for(ih=-nh; ih<=nh; ih++) {
if (ix+ih<nx && ix+ih>=0 && ix-ih<nx && ix-ih>=0 ) {
dR[it][ix+ih][iz] += ii[ih+nh][ix][iz]*us[it][ix-ih][iz];
dS[it][ix-ih][iz] += ii[ih+nh][ix][iz]*ur[it][ix+ih][iz];
}
}
As far as data races go, it is unsafe to parallelize loops in a way that causes the same memory location to be accessed by two different threads and at least one access is a write.
You never read and write to the same variable, so it should be safe to parallelize every loop. (Not necessarily more efficient though)
Your actual loops can also be rewritten.
Your if condition can be written logically as 0 <= ix+ih < nx && 0 <= ix-ih < nx, or in other words you only want to write between 0 and nx.
If we can show that the ranges of ix+ih and ix-ih is larger than 0 to nx we can eliminate the check and manually loop over those ranges.
Examining the loops we see that 0 < ix < nx and -nh < ih < nh which allows us to find the ranges of ix+ih and ix-ih.
ix+ih ranges from -nh to nx + nh and ix-ih ranges from -nh to nx+nh. Both of these ranges are larger than 0,nx as long as nh is positive and so we don't need to do a check at all. We can just loop from 0 to nx.
omp_set_nested(1);
#pragma omp parallel for
for(it=0; it<nt; it++) {
#pragma omp parallel for
for (iy = 0; iy < nx; iy++) {
#pragma omp parallel for
for(iz=0; iz<nz; iz++) {
dR[it][iy][iz] += ii[ih+nh][ix][iz] * us[it][ix-ih][iz] ;
dS[it][iy][iz] += ii[ih+nh][ix][iz] * ur[it][ix+ih][iz] ;
}
}
}

Reductions in parallel in logarithmic time

Given n partial sums it's possible to sum all the partial sums in log2 parallel steps. For example assume there are eight threads with eight partial sums: s0, s1, s2, s3, s4, s5, s6, s7. This could be reduced in log2(8) = 3 sequential steps like this;
thread0 thread1 thread2 thread4
s0 += s1 s2 += s3 s4 += s5 s6 +=s7
s0 += s2 s4 += s6
s0 += s4
I would like to do this with OpenMP but I don't want to use OpenMP's reduction clause. I have come up with a solution but I think a better solution can be found maybe using OpenMP's task clause.
This is more general than scalar addition. Let me choose a more useful case: an array reduction (see here, here, and here for more about array reductions).
Let's say I want to do an array reduction on an array a. Here is some code which fills private arrays in parallel for each thread.
int bins = 20;
int a[bins];
int **at; // array of pointers to arrays
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
#pragma omp single
at = (int**)malloc(sizeof *at * omp_get_num_threads());
at[omp_get_thread_num()] = (int*)malloc(sizeof **at * bins);
int a_private[bins];
//arbitrary function to fill the arrays for each thread
for(int i = 0; i<bins; i++) at[omp_get_thread_num()][i] = i + omp_get_thread_num();
}
At this point I have have an array of pointers to arrays for each thread. Now I want to add all these arrays together and write the final sum to a. Here is the solution I came up with.
#pragma omp parallel
{
int n = omp_get_num_threads();
for(int m=1; n>1; m*=2) {
int c = n%2;
n/=2;
#pragma omp for
for(int i = 0; i<n; i++) {
int *p1 = at[2*i*m], *p2 = at[2*i*m+m];
for(int j = 0; j<bins; j++) p1[j] += p2[j];
}
n+=c;
}
#pragma omp single
memcpy(a, at[0], sizeof *a*bins);
free(at[omp_get_thread_num()]);
#pragma omp single
free(at);
}
Let me try and explain what this code does. Let's assume there are eight threads. Let's define the += operator to mean to sum over the array. e.g. s0 += s1 is
for(int i=0; i<bins; i++) s0[i] += s1[i]
then this code would do
n thread0 thread1 thread2 thread4
4 s0 += s1 s2 += s3 s4 += s5 s6 +=s7
2 s0 += s2 s4 += s6
1 s0 += s4
But this code is not ideal as I would like it.
One problem is that there are a few implicit barriers which require all the threads to sync. These barriers should not be necessary. The first barrier is between filling the arrays and doing the reduction. The second barrier is in the #pragma omp for declaration in the reduction. But I can't use the nowait clause with this method to remove the barrier.
Another problem is that there are several threads that don't need to be used. For example with eight threads. The first step in the reduction only needs four threads, the second step two threads, and the last step only one thread. However, this method would involve all eight threads in the reduction. Although, the other threads don't do much anyway and should go right to the barrier and wait so it's probably not much of an issue.
My instinct is that a better method can be found using the omp task clause. Unfortunately I have little experience with the task clause and all my efforts so far with it do a reduction better than what I have now have failed.
Can someone suggest a better solution to do the reduction in logarithmic time using e.g. OpenMP's task clause?
I found a method which solves the barrier problem. This reduces asynchronously. The only remaining problem is that it still puts threads which don't participate in the reduction into a busy loop. This method uses something like a stack to push pointers to the stack (but never pops them) in critical sections (this was one of the keys as critical sections don't have implicit barriers. The stack is operated on serially but the reduction in parallel.
Here is a working example.
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <string.h>
void foo6() {
int nthreads = 13;
omp_set_num_threads(nthreads);
int bins= 21;
int a[bins];
int **at;
int m = 0;
int nsums = 0;
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
int n = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
at = (int**)malloc(sizeof *at * n * 2);
int* a_private = (int*)malloc(sizeof *a_private * bins);
//arbitrary fill function
for(int i = 0; i<bins; i++) a_private[i] = i + omp_get_thread_num();
#pragma omp critical (stack_section)
at[nsums++] = a_private;
while(nsums<2*n-2) {
int *p1, *p2;
char pop = 0;
#pragma omp critical (stack_section)
if((nsums-m)>1) p1 = at[m], p2 = at[m+1], m +=2, pop = 1;
if(pop) {
for(int i = 0; i<bins; i++) p1[i] += p2[i];
#pragma omp critical (stack_section)
at[nsums++] = p1;
}
}
#pragma omp barrier
#pragma omp single
memcpy(a, at[2*n-2], sizeof **at *bins);
free(a_private);
#pragma omp single
free(at);
}
for(int i = 0; i<bins; i++) printf("%d ", a[i]); puts("");
for(int i = 0; i<bins; i++) printf("%d ", (nthreads-1)*nthreads/2 +nthreads*i); puts("");
}
int main(void) {
foo6();
}
I sill feel a better method may be found using tasks which does not put the threads not being used in a busy loop.
Actually, it is quite simple to implement that cleanly with tasks using a recursive divide-and-conquer approach. This is almost textbook code.
void operation(int* p1, int* p2, size_t bins)
{
for (int i = 0; i < bins; i++)
p1[i] += p2[i];
}
void reduce(int** arrs, size_t bins, int begin, int end)
{
assert(begin < end);
if (end - begin == 1) {
return;
}
int pivot = (begin + end) / 2;
/* Moving the termination condition here will avoid very short tasks,
* but make the code less nice. */
#pragma omp task
reduce(arrs, bins, begin, pivot);
#pragma omp task
reduce(arrs, bins, pivot, end);
#pragma omp taskwait
/* now begin and pivot contain the partial sums. */
operation(arrs[begin], arrs[pivot], bins);
}
/* call this within a parallel region */
#pragma omp single
reduce(at, bins, 0, n);
As far as i can tell, there are no unnecessary synchronizations and there is no weird polling on critical sections. It also works naturally with a data size different than your number of ranks. I find it very clean and easy to understand. So I do indeed think this is better than both of your solutions.
But let's look at how it performs in practice*. For that we can use Score-p and Vampir:
*bins=10000 so the reduction actually takes a little bit of time. Executed on a 24-core Haswell system w/o turbo. gcc 4.8.4, -O3. I added some buffer around the actual execution to hide initialization/post-processing
The picture reveals what is happening at any thread within the application on a horizontal time-axis. The tree implementations from top to bottom:
omp for loop
omp critical kind of tasking.
omp task
This shows nicely how the specific implementations actually execute. Now it seems that the for loop is actually the fastest, despite the unnecessary synchronizations. But there are still a number of flaws in this performance analysis. For example, I didn't pin the threads. In practice NUMA (non-uniform memory access) matters a lot: Does the core does have this data in it's own cache / memory of it's own socket? This is where the task solution becomes non-deterministic. The very significant variance among repetitions is not considered in the simple comparison.
If the reduction operation becomes variable in runtime, then the task solution will become better than thy synchronized for loop.
The critical solution has some interesting aspect, the passive threads are not continuously waiting, so they will more likely consume CPU resources. This can be bad for performance e.g. in case of turbo mode.
Remember that the task solution has more optimization potential by avoiding spawning tasks that immediately return. How these solutions perform also highly depends on the specific OpenMP runtime. Intel's runtime seems to do much worse for tasks.
My recommendation is:
Implement the most maintainable solution with optimal algorithmic
complexity
Measure which parts of the code actually matter for run-time
Analyze based on actual measurements what is the bottleneck. In my experience it is more about NUMA and scheduling rather than some unnecessary barrier.
Perform the micro-optimization based on your actual measurements
Linear solution
Here is the timeline for the linear proccess_data_v1 from this question.
OpenMP 4 Reduction
So I thought about OpenMP reduction. The tricky part seems to be getting the data from the at array inside the loop without a copy. I do initialize the worker array with NULL and simply move the pointer the first time:
void meta_op(int** pp1, int* p2, size_t bins)
{
if (*pp1 == NULL) {
*pp1 = p2;
return;
}
operation(*pp1, p2, bins);
}
// ...
// declare before parallel region as global
int* awork = NULL;
#pragma omp declare reduction(merge : int* : meta_op(&omp_out, omp_in, 100000)) initializer (omp_priv=NULL)
#pragma omp for reduction(merge : awork)
for (int t = 0; t < n; t++) {
meta_op(&awork, at[t], bins);
}
Surprisingly, this doesn't look too good:
top is icc 16.0.2, bottom is gcc 5.3.0, both with -O3.
Both seem to implement the reduction serialized. I tried to look into gcc / libgomp, but it's not immediately apparent to me what is happening. From intermediate code / disassembly, they seem to be wrapping the final merge in a GOMP_atomic_start/end - and that seems to be a global mutex. Similarly icc wraps the call to the operation in a kmpc_critical. I suppose there wasn't much optimization going into costly custom reduction operations. A traditional reduction can be done with a hardware-supported atomic operation.
Notice how each operation is faster because the input is cached locally, but due to the serialization it is overall slower. Again this is not a perfect comparison due to high variances, and earlier screenshots were with different gcc version. But the trend is clear, and I also have data on the cache effects.

OpenMP - make array counter with thread different start number

I am strugling with quite tricky (I guess) problem with parallel file reading.
Right now I have maped file using mmap and I want it to read values and put into three arrays. Well, maybe explanations isn't so clear, so this is my current code:
int counter = header;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) firstprivate(counter)
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
green[i][j] = map[counter+1];
blue[i][j] = map[counter+2];
counter+=3;
}
}
the header is beginning of a file - just image related info like size and some comments.
The problem here is, that counter has to be private. What I am finding difficult is to come up with dividing the counter across threads with different start numbers.
Can someone give me a hint how it can be achieved?
You could, if I've understood your code correctly, leave counter as a shared variable and enclose the update to it in a single section, something like
blue[i][j] = map[counter+2];
#pragma omp single
{
counter+=3;
}
}
I write something like because I haven't carefully checked the syntax nor considered the usefulness of some of the optional clauses for the single directive. I suspect that this might prove a real drag on performance.
An alternative would be to radically re-order your loop nest, perhaps (and again this is not carefully checked):
#pragma omp parallel shared (red,green,blue,map) private (i,j)
#pragma for collapse(2) schedule(hmmm, this needs some thought)
for(counter = 0; counter < counter_max; counter += 3)
{
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
}
}
}
... loop for blue now ...
... loop for green now ...
Note that in this version OpenMP will collapse the first two loops into one iteration space, I expect this will provide better performance than uncollapsed looping but it's worth experimenting to figure that out. It's probably also worth experimenting with the schedule clause.
If I understood well your problem is that of initializing counter to different values for different threads.
If this is the case, then it can be solved like the linearization of matrix indexes:
size_t getLineStart(size_t ii, size_t start, size_t length) {
return start + ii*length*3;
}
int counter;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) private(counter)
for(size_t ii = 0; ii < x; ii++) {
////////
// Get the starting point for the current iteration
counter = getLineStart(ii,header,y);
////////
for(size_t jj = 0; jj < y; jj++) {
red [i][j] = map[counter ];
green[i][j] = map[counter+1];
blue [i][j] = map[counter+2];
counter+=3;
}
}
These kind of counters are useful when it's hard to know how many values will be counted but then they don't work easily with OpenMP. But in your case it's trivial to figure out what the index is: 3*(y*i+j). I suggest fusing the loop and doing something like this:
#pragma omp parallel for
for(n=0; n<x*y; n++) {
(&red [0][0])[n] = map[3*n+header+0];
(&green[0][0])[n] = map[3*n+header+1];
(&blue [0][0])[n] = map[3*n+header+2];
}

OpenMP optimizations?

I can't figure out why the performance of this function is so bad. I have a core 2 Duo machine and I know its only creating 2 trheads so its not an issue of too many threads. I expected the results to be closer to my pthread results.
these are my compilation flags (purposely not doing any optimization flags)
gcc -fopenmp -lpthread -std=c99 matrixMul.c -o matrixMul
These are my results
Sequential matrix multiply: 2.344972
Pthread matrix multiply: 1.390983
OpenMP matrix multiply: 2.655910
CUDA matrix multiply: 0.055871
Pthread Test PASSED
OpenMP Test PASSED
CUDA Test PASSED
void openMPMultiply(Matrix* a, Matrix* b, Matrix* p)
{
//int i,j,k;
memset(*p, 0, sizeof(Matrix));
int tid, nthreads, i, j, k, chunk;
#pragma omp parallel shared(a,b,p,nthreads,chunk) private(tid,i,j,k)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
}
chunk = 20;
// #pragma omp parallel for private(i, j, k)
#pragma omp for schedule (static, chunk)
for(i = 0; i < HEIGHT; i++)
{
//printf("Thread=%d did row=%d\n",tid,i);
for(j = 0; j < WIDTH; j++)
{
//#pragma omp parallel for
for(k = 0; k < KHEIGHT ; k++)
(*p)[i][j] += (*a)[i][k] * (*b)[k][j];
}
}
}
}
Thanks for any help.
As matrix multiplication is an embarrassingly parallel, its speedup should be near 2 on a dual core. Matrix multiplication even typically shows a superlinear speedup (greater than 2 on a dual core) due to reduced cache misses. I don't see obvious mistakes by looking your code, but something's wrong. Here is my suggestions:
Just double-check the number of worker threads. In your case, 2 threads should be created. Or, try to set by calling omp_set_num_threads. Also, see whether 2 cores are fully utilized (i.e., 100% CPU utilization on Windows, 200% on Linux).
Clean up your code by removing unnecessary nthreads and chunk. These can be prepared outside of the parallel section. But, even if so, it shouldn't hurt speedup.
Are matrices square (i.e., HEIGHT == WIDTH == KHEIGHT)? If it's not a square matrix, then there could be workload imbalance that can hurt speedup. But, given the speedup of pthread (around 1.6, which is also odd to me), I don't think there's too much workload imbalance.
Try to use a default static scheduling: don't specify chunk, just write #pragma omp for.
My best guess is that the structure of Matrix could be problematic. What exactly Matrix looks like? In worst case, false sharing could significantly hurt performance. But, in such simple matrix multiplication, false sharing shouldn't be a big problem. (If you don't know the detail, I may explain more details).
Although you commented out, never put #pragma omp parallel for at for-k, which causes nested parallel loop. In matrix multiplication, it's absolutely wasteful as the outer most loop is parallelizable.
Finally, try to run the following very simple OpenMP matrix multiplication code, and see the speedup:
double A[N][N], B[N][N], C[N][N];
#pragma omp parallel for
for (int row = 0; row < N; ++row)
for (int col = 0; col < N; ++col)
for (int k = 0; k < N; ++k)
C[row][col] += A[row][k]*B[k][col];

Resources