How to avoid fork-join when calling cblas_sgemm in MKL?

How to avoid fork-join when calling cblas_sgemm in MKL? - c

The code is like this:
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
When the matrix is not very large, the fork-join cost is very obvious, especially when this is run on MIC. Besides, separate the mission by hand will cause some problem on MIC as MKL Performance on Intel Phi shows.
//separate the left and result matrix by hand.
//not a wise solution on MIC
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
If there is a technique that I can use code:
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
where cblas_sgemm uses the threads forked out of the for loop since MKL also uses OpenMP to create threads.
Sincerely, FatRabb1t.

You could do that by linking the sequential version of MKL, so that cblas_sgemm will not fork multiple threads to calculate the matrix.
On ther other hand you could use OpenMP parallel for to speed up your code.
#pragma omp parallel for
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
By this way, you fork-join the threads only once instead of loop_count times.
If you are using Intel compiler icc/icpc, you could link the sequential MKL with the compiler option -mkl=sequential instead of -mkl.
If you are using other compilers such as gcc, you could use MKL link line advisor to help you generate the desired link line options.
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Related

Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
for(i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
}
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
for(i=0;i<nx;i++){
next[i] = (char*)(next+nx) + i*ny;
}
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(i=1;i<nx-1;i++){
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}
Is there any problematic mistake in my way? How can I improve this?

In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
}
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}

OpenMP - Overhead when Spawning and Terminating Threads in for-loop

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.
I have a for-loop which must be ran serially which calls the new_value() function:
for(int i = 0; i < MAX_VAL; i++)
new_value();
This function opens a parallel region on each call:
void new_value()
{
#pragma omp parallel default(shared)
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
}
Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?
There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:
This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.
This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.

The answer to your second question is actually correct.
#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
new_value();
void new_value()
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).
So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

how to avoid overhead of openMP in nested loops

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?

An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.

You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

Always does all lines inside openACC kernels work on GPU?

I wonder something related kernels structure. May not the every line inside kernels work on GPU?
for example i have this code:
#pragma acc kernels copy(a[0:n],b[0:n])
{
#pragma acc loop
for (i = 0; i < n; i++)
a[i] = i+10;
a[1] = 10;
a[3] = 5;
#pragma acc loop
for (i = 0; i < n; i++)
b[i] = i+20;
}
Also Is the the situation same for acc parallel structure?

Quoting the spec, about kernels construct:
The compiler will break the code in the kernels region into a sequence
of accelerator kernels. Typically, each loop nest will be a distinct
kernel. When the program encounters a kernels construct, it will
launch the sequence of kernels in order on the device.
So the sequence
a[1] = 10;
a[3] = 5;
that you have put between the two loops could be executed on the device. Problem is, since this code is not in a loop, the OpenACC compiler will have to create a "fake" loop with just one iteration to execute it on the GPU.
Since it's often slower to do this, some OpenACC compilers prefer to execute such sequential lines on the host, after having downloaded the data.
For parallel sections, the answer is simpler: all code is always executed on the device.

OpenMP optimizations?

I can't figure out why the performance of this function is so bad. I have a core 2 Duo machine and I know its only creating 2 trheads so its not an issue of too many threads. I expected the results to be closer to my pthread results.
these are my compilation flags (purposely not doing any optimization flags)
gcc -fopenmp -lpthread -std=c99 matrixMul.c -o matrixMul
These are my results
Sequential matrix multiply: 2.344972
Pthread matrix multiply: 1.390983
OpenMP matrix multiply: 2.655910
CUDA matrix multiply: 0.055871
Pthread Test PASSED
OpenMP Test PASSED
CUDA Test PASSED
void openMPMultiply(Matrix* a, Matrix* b, Matrix* p)
{
//int i,j,k;
memset(*p, 0, sizeof(Matrix));
int tid, nthreads, i, j, k, chunk;
#pragma omp parallel shared(a,b,p,nthreads,chunk) private(tid,i,j,k)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
}
chunk = 20;
// #pragma omp parallel for private(i, j, k)
#pragma omp for schedule (static, chunk)
for(i = 0; i < HEIGHT; i++)
{
//printf("Thread=%d did row=%d\n",tid,i);
for(j = 0; j < WIDTH; j++)
{
//#pragma omp parallel for
for(k = 0; k < KHEIGHT ; k++)
(*p)[i][j] += (*a)[i][k] * (*b)[k][j];
}
}
}
}
Thanks for any help.

As matrix multiplication is an embarrassingly parallel, its speedup should be near 2 on a dual core. Matrix multiplication even typically shows a superlinear speedup (greater than 2 on a dual core) due to reduced cache misses. I don't see obvious mistakes by looking your code, but something's wrong. Here is my suggestions:
Just double-check the number of worker threads. In your case, 2 threads should be created. Or, try to set by calling omp_set_num_threads. Also, see whether 2 cores are fully utilized (i.e., 100% CPU utilization on Windows, 200% on Linux).
Clean up your code by removing unnecessary nthreads and chunk. These can be prepared outside of the parallel section. But, even if so, it shouldn't hurt speedup.
Are matrices square (i.e., HEIGHT == WIDTH == KHEIGHT)? If it's not a square matrix, then there could be workload imbalance that can hurt speedup. But, given the speedup of pthread (around 1.6, which is also odd to me), I don't think there's too much workload imbalance.
Try to use a default static scheduling: don't specify chunk, just write #pragma omp for.
My best guess is that the structure of Matrix could be problematic. What exactly Matrix looks like? In worst case, false sharing could significantly hurt performance. But, in such simple matrix multiplication, false sharing shouldn't be a big problem. (If you don't know the detail, I may explain more details).
Although you commented out, never put #pragma omp parallel for at for-k, which causes nested parallel loop. In matrix multiplication, it's absolutely wasteful as the outer most loop is parallelizable.
Finally, try to run the following very simple OpenMP matrix multiplication code, and see the speedup:
double A[N][N], B[N][N], C[N][N];
#pragma omp parallel for
for (int row = 0; row < N; ++row)
for (int col = 0; col < N; ++col)
for (int k = 0; k < N; ++k)
C[row][col] += A[row][k]*B[k][col];