I've written a serial method that involves four nested for loops - I'd like to parallelize this method using OpenACC (this is the first time I've tried using it and I'm not very familiar with all the directives).
I tried the following but see the following error: call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
I've pasted a simplified pseudocode version of my method below, I'd really appreciate help figuring out the best way to parallelize this four nested loop structure.
// a, b, and c are input arguments to this method
#pragma acc parallel
for(int j = 0; j < a; j++){
for(int i = 0; i < b; i++){
// computing mins and maxs based on formulas with i, j, a, b, and c
int minX = ...
int maxX = ...
int minY = ...
int maxY = ...
double count = (maxX - minX + 1)*(maxY - minY + 1);
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
#pragma acc loop
for (int y = minY; y < maxY; y++) {
for (int x = minX; x < maxX; x++) {
#pragma acc routine(function_call_name) seq
sum1 += // some function call;
sum2 += // some function call;
sum3 += // some function call;
}
}
int result1 = (int)(sum1/count);
int result2 = (int)(sum2/count);
int result3 = (int)(sum3/count);
#pragma acc routine(function_call_name) seq
// calling some function call to store result1, result2, result3 in the output
}
}
An "illegal address" means that your program is accessing a bad address on the GPU. Typically this is caused by an out-of-bounds access, accessing a host address on the device, using an aggregate data structure with dynamic data members and not "attaching" the members (i.e. setting the device pointers in the parent structure). Less common cases are heap or stack overflows.
How are you managing your data? Data regions elsewhere in your code?
If using PGI, try first targeting a multicore CPU (-ta=multicore) so you don't need to worry about data movement. Once you have the parallel regions working, you can then go back to using the GPU and work on the data movement. I'd recommend you start by using CUDA Unified Memory (-ta=tesla:managed) so the CUDA driver handles the data movement for you (dynamic data only). Then once this is working, try adding data regions to manually manage the data.
Other things I see:
The parallel construct needs a loop directive on the outer loops.
#pragma acc parallel loop
for(int j = 0; j < a; j++){
for(int i = 0; i < b; i++){
You may consider collapsing the loops depending on the loop trip count:
#pragma acc parallel loop collapse(2)
for(int j = 0; j < a; j++){
for(int i = 0; i < b; i++){
Also, the "routine" directive should decorate the routine's prototype or definition but shouldn't be used in a compute regions.
If you are using any global variables in your device routines, be sure to put then into "declare" directives so a global copy of the data is created on the device.
If you are using PGI, add the "-Minfo=accel" option to the compiler. This will give you the compiler feedback messages on how the compiler is parallelizing your code.
If you aren't using data directives, the compiler will need to implicitly copy the data. The messages will tell you what arrays are being copied along with the size being copied.
If you have trouble understanding the feedback messages, post the output from your compilation and I'll help you walk through them.
Related
I am trying to do a reduction on multiple variables (an array) using OMP, but wasn't sure how to implement it with OMP. See the code below.
#pramga omp parallel for reduction( ??? )
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
[ compute value ... ]
y[j] += value
}
}
I thought I could do something like this, with the atomic keyword, but realised this would prevent two threads from updating y at the same time even if they are updating different values.
#pramga omp parallel for
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
[ compute value ... ]
#pragma omp atomic
y[j] += value
}
}
Does OMP have any functionality for something like this or otherwise how would I achieve this optimally without OMP's reduction keyword?
There is an array reduction available in OpenMP since version 4.5:
#pramga omp parallel for reduction(+:y[:m])
where m is the size of the array. The only limitation here is that the local array used in reduction is always reserved on the stack, so it cannot be used in the case of large arrays.
The atomic operation you mentioned should work fine, but it may be less efficient than reduction. Of course, it depends on the actual circumstances (e.g. actual value of n and m, time to compute value, false sharing, etc.).
#pragma omp atomic
y[j] += value
I have a problem with my code, it should print number of appearances of a certain number.
I want parallelize this code with OpenMP, and I tried to use reduction for arrays but it's obviously didn't working as I wanted.
The error is: "segmentation fault". Should some variables be private? or it's the problem with the way I'm trying to use the reduction?
I think each thread should count some part of array, and then merge it somehow.
#pragma omp parallel for reduction (+: reasult[:i])
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
if ( numbers[j] == i){
result[i]++;
}
}
}
Where N is big number telling how many numbers I have. Numbers is array of all numbers and result array with sum of each number.
First you have a typo on the name
#pragma omp parallel for reduction (+: reasult[:i])
should actually be "result" not "reasult"
Nonetheless, why are you section the array with result[:i]? Based on your code, it seems that you wanted to reduce the entire array, namely:
#pragma omp parallel for reduction (+: result)
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
When one's compiler does not support the OpenMP 4.5 array reduction feature one can alternatively explicitly implement the reduction (check this SO thread to see how).
As pointed out by #Hristo Iliev in the comments:
Provided that M * sizeof(result[0]) / #threads is a multiple of the
cache line size, and even if it isn't when the value of M is large
enough, there is absolutely no need to involve reduction in the
process. Unless the program is running on a NUMA system, that is.
Assuming that the aforementioned conditions are met, and if you analyze carefully the outermost loop iterations (i.e., variable i) are assigned to the threads, and since the variable i is used to access the result array, each thread will be updating a different position of the result array. Therefore, you can simplified your code to:
#pragma omp parallel for
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
I have a simple for-loop that calculates an array[n] depending on the
corresponding row at an array X[n][d].
array *function(X, n, d){
double *array = calloc(n,sizeof(double));
//#pragma omp parallel
{
//#pragma omp parallel for if(n>15000)
for( i=0 ; i<n-1; i++)
{
//#pragma omp parallel for shared(X,j, i) reduction(+: sum)
//#pragma omp parallel for if(d>100) reduction(+:distances[:n]) private(j)
for ( j=0; j< d; j++)
{
array[i] += (pow(X[(j+1)*n-1]-X[j*n+i], 2));
}
array[i] = sqrt(array[i]);
}
}
return array;
}
Consider n to be as high as n=100000 and d can have a predefined value from d=2 to d=100. The function() is called multiple times (2^k) at each kth iteration. So the pattern is: at the first iteration it is called once, at the second iteration it is called twice, at the third iteration it is called four times etc...Also n diminishes by one in every iteration (n-=1).
I have tried different combinations of the openmp directives that I have put as comments in the sample code but no matter what I have tried, the code performs equally or better without the openmp directives.
What are some good ways/techniques to improve the time performance of the above loop using openmp?
It is hard to tell without something to test it, but I would try something like this:
double* function( double* X, int n, int d ) {
double *array = malloc( n * sizeof( double ) );
#pragma omp parallel for schedule( static )
for( int i = 0 ; i < n - 1; i++ ) {
double sum = 0;
for( int j = 0; j < d; j++ ) {
double dist = X[( j + 1 ) * n - 1] - X[j * n + i];
sum += dist * dist;
}
array[i] = sqrt( sum );
}
return array;
}
I'm not sure it will be any more effective than your code, but it has a few improvements which should have an impact in performance:
To avoid initializing the array to 0 in the sequential part and also allow for hopefully better optimization from the compiler, I replaced the calloc() by a plain malloc() and used a local variable sum for accumulating the partial sums.
I've put the parallelization pragma as outermost as I could to maximize parallelism.
I've used a local dist variable to store the temporary distance between the 2 current values of X, and just multiplied it by itself, avoiding a costly call to the much more complex pow() function.
Now, depending on the result you get from that, you could consider adding the same sort of if( n > NMIN ) statement as you had initially on the #pragma omp parallel for directive. The value for this NMIN would be for you to determine according to your measured performance.
Another possible direction for optimization would be to place the parallel directive outside of the function. That would spare you numerous thread starts/stops. However, you'd have to add a #pragma omp single before the call to malloc() and remove the parallel from the existing directive.
I'm trying to vectorize an old matrix multiplication program I made, specifically this function using a parallel for call in openmp. I keep getting this error:
matrix_multiply.c(26): error: invalid entity for this variable list in omp clause
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
Any help would be much appreciated as I've tried looking up the error and can't find any documentation that was helpful. I'm compiling using ICC if that makes a difference.
void matrix_mult(int * matrix_A, int * matrix_B, int n)
{
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
int sum = 0;
for (int k = 0; k<n; k++)
{
int index_a = i * n +k;
int index_b = j + k * n;
sum += matrix_A[index_a] * matrix_A[index_b];
}
matrix_B[i * n + j] = sum;
}
}
}
There're two things worth mentioning here:
What you're actually doing here isn't vectorizing (although your compiler might be doing it for you), it is parallelizing. Here, you're creating threads to split the work among. Each thread may or may not use the CPU's vector units to spreed the computations up even more, but it has nothing to do with the parallelization directives you've put.
The error the compiler reports only says that it doesn't known the variables you've listed in your private directive. Indeed, if you look closer, neither of i, j, k, and sum have been declared before the directive line. So for the compiler, they don't exist (yet). As a matter of fact, since you only declare them when you need them (which is very good), which is inside the parallel region, you don't have to declare them privateanyway since they already are private to the the thread where they are created. So just removing the private clause should fix your issue.
Finally, if performance matters to you, rather than trying to parallelize or vectorize this code, just consider replacing it by an effective library call that will do it for you. Unfortunately, since you're dealing with integers, BLAS won't do. But I'm sure there are good options out there for that.
I'm modifying existing library from single thread to multi threading. I have code like a provided below. I can't understand how to declare arrays x, y, array1, array2. Which of them I should declare as share or threadprivate. Do I need use flush. If yes in which case ?
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for
for(i=0, i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
array1[array2[j]] += z[j]+j;
return array1[j];
}
The problem I see in your code is in this line
array1[array2[j]] += z[j]+j;
This means that array1 can potentially be modified by whichever j index. And j in the context of the function myfunc() corresponds to index i at the upper level. The trouble is that i is the index upon which the loop is parallelised, therefore, this means that array1 can be modified concurrently at any moment by any thread.
The crucial question now is to know if array2 can have the same value for different indexes:
If you are sure that for whatever j1 != j2 you have array2[j1] != array2[j2], then your code is trivially parallelisable.
If there are values j1 != j2 for which you have array[j1] == array[j2], then you have dependencies across iterations for array1 and the code is no longer (simply and/or effectively) parallelisable.
So let's assume we are in the former case, then the OpenMP directives you have already in the code are sufficient:
i needs to be private but is implicitly already so as it is the index of the parallelised loop;
x and y should to be shared (which they are by default) since their access index is the one that is distributed in parallel (namely i) so their parallel updates do not overlap;
array2 is only accessed in read mode so it's a no brainer shared (which it is by default again);
array1 is read and written, but due to our initial assumption, there are no possible collisions between threads as their sets of indexes to access it are disjoin. Therefore, the default shared qualifier just works fine.
But now, if we are in the case where array2 allows for non-disjoin sets of indexes for accessing array1, we will have to preserve the ordering of these accesses / updates of array1. This can be done with the ordered clause / directive. And since we still want the parallelisation to be (somewhat) effective, we will have to add a schedule(static,1) clause to the parallel directive. For more details about this, please refer to this great answer. Your code would now look like this:
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for schedule(static,1) ordered
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
int tmp = z[j]+j;
#pragma omp ordered
array1[array2[j]] += tmp;
return array1[j];
}
This would (I think) work and be in term of parallelism not too bad (for a limited number of threads), but this has a big (enormous) flaw: it generates tons of false sharing while updating x and y. Therefore, it might be more advantageous to use some per-thread copies of these and to only update the global arrays at the end. The central part of code snippet would then look something like this (not tested at all):
#pragma omp parallel
#pragma omp single
int nbth = omp_get_num_threads();
int *xm = malloc(1000000*nbth*sizeof(int));
int *ym = malloc(1000000*nbth*sizeof(int));
#pragma omp parallel
{
int tid = omp_get_thread_num();
int *xx = xm+1000000*tid;
int *yy = ym+1000000*tid;
#pragma omp for schedule(static,1) ordered
for(i=0; i<100; i++)
{
yy[i] = i*i-3*i-10*random();
xx[i] = myfunc(i, y[i])
}
#pragma omp for
for (i=0; i<100; i++)
{
int j;
x[i] = 0;
y[i] = 0;
for (j=0; j<nbth; j++)
{
x[i] += xm[j*1000000+i];
y[i] += ym[j*1000000+i];
}
}
}
free(xm);
free(ym);
This will avoid the false sharing, but will increase the number of memory accesses and the overhead of parallelisation. So it might not be very beneficial after all. You'll have to see it for yourself in your actual code.
BTW, the fact that i only loops until 100 looks suspicious to me when the corresponding arrays are declared to be 1000000 long. If 100 is truly the correct size for the loop, then probably the parallelisation isn't worth it anyway...
EDIT:
As Jim Cownie pointed it out in a comment, I missed the call to random() as source of dependency across iterations, preventing from proper parallelisation. I'm not sure how relevant this is in the context of your actual code (I doubt you truly fill your y array with random data) but in case you do, you'll have to change this part in order to do it in parallel (otherwise, the serialisation needed to have the random number series generated will just kill whichever gain from parallelisation). But generating non-correlated pseudo-random series in parallel is not as simple as it sounds. You can use rand_r() instead of random() as a thread-safe alternative for the RNG and initialise its seed per-thread to different values. However, you're not sure that one thread's series won't collide with another thread's one too soon (with a thread starting to generate the very same series than another one after a while, messing-up your expected asymptotic behaviour).
As I'm pretty sure you're not truly interested in that, I won't develop any further (this is a whole question all by itself), but I will just use the (not so good) rand_r() trick. If you want more details on a possible alternative for generating good parallel random series, just ask another question.
The case where no problem comes from array2 (disjoin sets of indexes), the code would become:
// global variable
unsigned int seed;
#pragma omp threadprivate(seed)
// done just once somewhere
#pragma omp parallel
seed = omp_get_thread_num(); //or something else, but different for each thread
// then the parallelised loop
#pragma omp parallel for
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*rand_r(&seed);
x[i] = myfunc(i, y[i])
}
Then the other case would have to use the same trick in addition to what has already been described. But again, keep in mind that this isn't good enough for serious RNG based computation (like Monte-Carlo methods). Its does the job if all you want is generate some values for testing purpose, but it won't pass any serious statistical quality test.