Always does all lines inside openACC kernels work on GPU? - c

I wonder something related kernels structure. May not the every line inside kernels work on GPU?
for example i have this code:
#pragma acc kernels copy(a[0:n],b[0:n])
{
#pragma acc loop
for (i = 0; i < n; i++)
a[i] = i+10;
a[1] = 10;
a[3] = 5;
#pragma acc loop
for (i = 0; i < n; i++)
b[i] = i+20;
}
Also Is the the situation same for acc parallel structure?

Quoting the spec, about kernels construct:
The compiler will break the code in the kernels region into a sequence
of accelerator kernels. Typically, each loop nest will be a distinct
kernel. When the program encounters a kernels construct, it will
launch the sequence of kernels in order on the device.
So the sequence
a[1] = 10;
a[3] = 5;
that you have put between the two loops could be executed on the device. Problem is, since this code is not in a loop, the OpenACC compiler will have to create a "fake" loop with just one iteration to execute it on the GPU.
Since it's often slower to do this, some OpenACC compilers prefer to execute such sequential lines on the host, after having downloaded the data.
For parallel sections, the answer is simpler: all code is always executed on the device.

Related

Make out-of-order CPU run instructions in-order

Consider the loop:
for (int i = 0; i < n; i++) {
sum += a[i];
}
An out-of-order CPU can execute many instructions in advance, it can e.g. have 20 parallel pending loads of a[i] from 20 different iterations of the loop.
But, for me, this is a hindrance. I want that the CPU works like an in-order CPU. I want it to not start a load in the next iteration until it has finished the load in the current iteration.
The reason I want this is very simple: I want to save the memory bandwidth for other processes running on other CPU core. This process is low priority, and I want to limit is as much as possible even though it will get slower.
Two techniques come to mind: fake loop-carried dependencies and memory barriers.
For fake dependencies, something like this can be used:
double* a_current = a;
for (int i = 0; i < n; i++) {
volatile int a_val = *a_current;
sum += a_val;
a_current += 1 + (a_val - a_val);
}
This is horrible code and I wonder if there is something better.
About memory barriers, I don't know almost anything. What could be useful there?

How to avoid fork-join when calling cblas_sgemm in MKL?

The code is like this:
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
When the matrix is not very large, the fork-join cost is very obvious, especially when this is run on MIC. Besides, separate the mission by hand will cause some problem on MIC as MKL Performance on Intel Phi shows.
//separate the left and result matrix by hand.
//not a wise solution on MIC
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
If there is a technique that I can use code:
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
where cblas_sgemm uses the threads forked out of the for loop since MKL also uses OpenMP to create threads.
Sincerely, FatRabb1t.
You could do that by linking the sequential version of MKL, so that cblas_sgemm will not fork multiple threads to calculate the matrix.
On ther other hand you could use OpenMP parallel for to speed up your code.
#pragma omp parallel for
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
By this way, you fork-join the threads only once instead of loop_count times.
If you are using Intel compiler icc/icpc, you could link the sequential MKL with the compiler option -mkl=sequential instead of -mkl.
If you are using other compilers such as gcc, you could use MKL link line advisor to help you generate the desired link line options.
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Specify which positions in an array a thread access

I'm trying to create a program that creates an array and, with OpenMP, assigns values to each position in that array. That would be trivial, except that I want to specify which positions an array is responsible for.
For example, if I have an array of length 80 and 8 threads, I want to make sure that thread 0 only writes to positions 0-9, thread 1 to 10-19 and so on.
I'm very new to OpenMP, so I tried the following:
#include <omp.h>
#include <stdio.h>
#define N 80
int main (int argc, char *argv[])
{
int nthreads = 8, tid, i, base, a[N];
#pragma omp parallel
{
tid = omp_get_thread_num();
base = ((float)tid/(float)nthreads) * N;
for (i = 0; i < N/nthreads; i++) {
a[base + i] = 0;
printf("%d %d\n", tid, base+i);
}
}
return 0;
}
This program, however, doesn't access all positions, as I expected. The output is different every time I run it, and it might be for example:
4 40
5 51
5 52
5 53
5 54
5 55
5 56
5 57
5 58
5 59
5 50
4 40
6 60
6 60
3 30
0 0
1 10
I think I'm missing a directive, but I don't know which one it is.
The way to ensure that things work the way you want is to have a loop of just 8 iterations as the outer (parallel) loop, and have each thread execute an inner loop which accesses just the right elements:
#pragma omp parallel for private(j)
for(i = 0; i < 8; i++) {
for(j = 0; j < 10; j++) {
a[10*i+j] = 0;
printf("thread %d updated element %d\n", omp_get_thread_num(), 8*i+j);
}
}
I was unable to test this right now but I'm 90% sure this does exactly what you want (and you have "complete control" over how things work when you do it like this). However it may not be the most efficient thing to do. For one thing - when you just want to set a bunch of elements to zero, you want to use a built in function like memset, not a loop...
You're missing a fair bit. The directive
#pragma omp parallel
only tells the run time that the following block of code is to be executed in parallel, essentially by all threads. But it doesn't specify that the work is to be shared out across threads, just that all threads are to execute the block. To share the work your code will need another directive, something like this
#pragma omp parallel
{
#pragma omp for
...
It's the for directive which distributes the work across threads.
However, you are making a mistake in the design of your program which is even more serious than your unfamiliarity with the syntax of OpenMP. Manual decomposition of work across threads, as you propose, is just what OpenMP is designed to help programmers avoid. By trying to do the decomposition yourself you are programming against the grain of OpenMP and run two risks:
Of getting things wrong; in particular of getting wrong matters that the compiler and run-time will get right with no effort or thought on your part.
Of carefully crafting a parallel program which runs more slowly than its serial equivalent.
If you want some control over the allocation of work to threads investigate the schedule clause. I suggest that you start your parallel region something like this (note that I am fusing the two directives into one statement):
#pragma omp parallel for default(none) shared(a,base,N)
{
for (i = 0; i < N; i++) {
a[base + i] = 0;
}
Note also that I have specified the accessibility of variables. This is a good practice especially when learning OpenMP. The compiler will make i private automatically.
As I have written it the run-time will divide the iterations over i into chunks, one for each thread. The first thread will get i = 0..N/num_threads, the second i = (N/num_threads)+1..2N/num_threads and so on.
Later you can add a schedule clause explicitly to the directive. What I have written above is equivalent to
#pragma omp parallel for default(none) shared(a,N) schedule(static)
but you can also experiment with
#pragma omp parallel for default(none) shared(a,N) schedule(dynamic,chunk_size)
and a number of other options which are well documented in the usual places.
#pragma omp parallel is not enough for the for loop to be parallelized.
Ummm... I noticed that you actually try to distribute work by hand. The reason it does not work is most probably becasue of racing conditions on computing the parameters for the for loop.
If I recall properly any variables declared outside of the parallel region are shared among threads. So ALL threads write to i, tid and base at once. You could make it work with appropriate private/shared clauses.
However, a better ways is to let OpenMP distribute the work.
This is sufficient:
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
#pramga omp for
for (i = 0; i < N; i++) {
a[i] = 0;
printf("%d %d\n", tid, i);
}
}
Note that private(tid) it makes a local copy of tid for each thread, so they do not overwrite each other on the omp_get_thread_num(). Also it is possible to declare shared(a) because we want each thread to work on the same copy of table. This is implicit now. I believe iterators should be declared private, but I think pragma takes care of it, not 100% how it is this specific case, when its declared outside the parallel region. But I'm sure you can actually set it to shared by hand and mess it up.
EDIT: I noticed original underlying problem so I took out irrelevant parts.

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

unbalanced nested for loops in openmp

I've been trying to parallelize an algorithm with unbalanced nested for loops using OpenMP. I can't post the original code as it's a secret project of an unheard government but here's a toy example:
for (i = 0; i < 100; i++) {
#pragma omp parallel for private(j, k)
for (j = 0; j < 1000000; j++) {
for (k = 0; k < 2; k++) {
temp = i * j * k; /* dummy operation (don't mind the race) */
}
if (i % 2 == 0) temp = 0; /* so I can't use openmp collapse */
}
}
Currently this example is working slower in multiple threads (~1 sec in single thread ~2.4 sec in 2 threads etc.).
Things to note:
Outer for loop needs to be done in order (dependent on the previous step) (As far as I know, OpenMP handles inner loops well so threads don't get created/destroyed at each step, right?)
Typical index numbers are given in the example (100, 1000000, 2)
Dummy operation consists of just a few operations
There are some conditional operations outside the inner most loop so collapse is not an option (doesn't seem like it would increase the performance anyways)
Looks like an embarrassingly parallel algorithm but I can't seem to get any speedups for the last two days. What would be the best strategy here?
Unfortunately this embarrassingly parallel algorithm is an embarrassingly bad example of how performant parallelism should be implemented. And since my crystall ball tells me that besides i, temp is also a shared automatic variable, I would assume it for the rest of this text. It also tells me that you have a pre-Nehalem CPU...
There are two sources of slowdown here - code transformation and cache coherency.
The way parallel regions are implmentend is that their code is extracted in separate functions. Shared local variables are extracted into structures that are then shared between the threads in the team that executes the parallel region. Under the OpenMP transformations your code sample would become something similiar to this:
typedef struct {
int i;
int temp;
} main_omp_fn_0_shared_vars;
void main_omp_fn_0 (void *data) {
main_omp_fn_0_shared_vars *vars = data;
// compute values of j_min and j_max for this thread
for (j = j_min; j < j_max; j++) {
for (k = 0; k < 2; k++) {
vars->temp = vars->i * j * k;
if (vars->i % 2 == 0) vars->temp = 0;
}
}
int main (void) {
int i, temp;
main_omp_fn_0_shared_vars vars;
for (i = 0; i < 100; i++)
{
vars.i = i;
vars.temp = temp;
// This is how GCC implements parallel regions with libgomp
// Start main_omp_fn_0 in the other threads
GOMP_parallel_start(main_omp_fn_0, &vars, 0);
// Start main_omp_fn_0 in the main thread
main_omp_fn_0(&vars);
// Wait for other threads to finish (implicit barrier)
GOMP_parallel_end();
i = vars.i;
temp = vars.temp;
}
}
You pay a small penalty for accessing temp and i this way as their intermediate values cannot be stored in registers but are loaded and stored each time.
The other source of degradation is the cache coherency protocol. Accessing the same memory location from multiple threads executing on multiple CPU cores leads to lots of cache invalidation events. Worse, vars.i and vars.temp are likely to end up in the same cache line and although vars.i is only read from and vars.temp is only written to, full cache invalidation is likely to occur at each iteration of the inner loop.
Normally access to shared variables is protected by explicit synchronisation constructs like atomic statements and critical sections and performance degradation is well expected in that case.
Think of the overheads:
Since your outer loop needs to be in order you're creating x threads to perform the work in the inner loop, destroying them, then creating them again... and so on 100 times.
You have to wait until the longest task within the inner loop completes its work before performing the next step in the outer loop, so essentially this is a synchronization overhead. The tasks don't look irregular, but if the work to perform is small there's only so much speedup you can get out of this.
You have the cost of thread creation here, and allocating the private variables.
If the work inside the inner loop is small the benefits of parallelising this loop might not necessarily outweigh the cost of the parallelisation overheads above, hence you end up with a slowdown.

Resources