I have a problem with my code, it should print number of appearances of a certain number.
I want parallelize this code with OpenMP, and I tried to use reduction for arrays but it's obviously didn't working as I wanted.
The error is: "segmentation fault". Should some variables be private? or it's the problem with the way I'm trying to use the reduction?
I think each thread should count some part of array, and then merge it somehow.
#pragma omp parallel for reduction (+: reasult[:i])
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
if ( numbers[j] == i){
result[i]++;
}
}
}
Where N is big number telling how many numbers I have. Numbers is array of all numbers and result array with sum of each number.
First you have a typo on the name
#pragma omp parallel for reduction (+: reasult[:i])
should actually be "result" not "reasult"
Nonetheless, why are you section the array with result[:i]? Based on your code, it seems that you wanted to reduce the entire array, namely:
#pragma omp parallel for reduction (+: result)
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
When one's compiler does not support the OpenMP 4.5 array reduction feature one can alternatively explicitly implement the reduction (check this SO thread to see how).
As pointed out by #Hristo Iliev in the comments:
Provided that M * sizeof(result[0]) / #threads is a multiple of the
cache line size, and even if it isn't when the value of M is large
enough, there is absolutely no need to involve reduction in the
process. Unless the program is running on a NUMA system, that is.
Assuming that the aforementioned conditions are met, and if you analyze carefully the outermost loop iterations (i.e., variable i) are assigned to the threads, and since the variable i is used to access the result array, each thread will be updating a different position of the result array. Therefore, you can simplified your code to:
#pragma omp parallel for
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
Related
I am trying to do a reduction on multiple variables (an array) using OMP, but wasn't sure how to implement it with OMP. See the code below.
#pramga omp parallel for reduction( ??? )
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
[ compute value ... ]
y[j] += value
}
}
I thought I could do something like this, with the atomic keyword, but realised this would prevent two threads from updating y at the same time even if they are updating different values.
#pramga omp parallel for
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
[ compute value ... ]
#pragma omp atomic
y[j] += value
}
}
Does OMP have any functionality for something like this or otherwise how would I achieve this optimally without OMP's reduction keyword?
There is an array reduction available in OpenMP since version 4.5:
#pramga omp parallel for reduction(+:y[:m])
where m is the size of the array. The only limitation here is that the local array used in reduction is always reserved on the stack, so it cannot be used in the case of large arrays.
The atomic operation you mentioned should work fine, but it may be less efficient than reduction. Of course, it depends on the actual circumstances (e.g. actual value of n and m, time to compute value, false sharing, etc.).
#pragma omp atomic
y[j] += value
I'm trying to vectorize an old matrix multiplication program I made, specifically this function using a parallel for call in openmp. I keep getting this error:
matrix_multiply.c(26): error: invalid entity for this variable list in omp clause
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
Any help would be much appreciated as I've tried looking up the error and can't find any documentation that was helpful. I'm compiling using ICC if that makes a difference.
void matrix_mult(int * matrix_A, int * matrix_B, int n)
{
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
int sum = 0;
for (int k = 0; k<n; k++)
{
int index_a = i * n +k;
int index_b = j + k * n;
sum += matrix_A[index_a] * matrix_A[index_b];
}
matrix_B[i * n + j] = sum;
}
}
}
There're two things worth mentioning here:
What you're actually doing here isn't vectorizing (although your compiler might be doing it for you), it is parallelizing. Here, you're creating threads to split the work among. Each thread may or may not use the CPU's vector units to spreed the computations up even more, but it has nothing to do with the parallelization directives you've put.
The error the compiler reports only says that it doesn't known the variables you've listed in your private directive. Indeed, if you look closer, neither of i, j, k, and sum have been declared before the directive line. So for the compiler, they don't exist (yet). As a matter of fact, since you only declare them when you need them (which is very good), which is inside the parallel region, you don't have to declare them privateanyway since they already are private to the the thread where they are created. So just removing the private clause should fix your issue.
Finally, if performance matters to you, rather than trying to parallelize or vectorize this code, just consider replacing it by an effective library call that will do it for you. Unfortunately, since you're dealing with integers, BLAS won't do. But I'm sure there are good options out there for that.
I am trying to use OpenMP to add the numbers in an array. The following is my code:
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static)
for(i=0;i<snum;i++)
{
int* tmpsum = input+i;
sum += *tmpsum;
}
This does not produce the right result for sum. What's wrong?
Your code currently has a race condition, which is why the result is incorrect. To illustrate why this is, let's use a simple example:
You are running on 2 threads and the array is int input[4] = {1, 2, 3, 4};. You initialize sum to 0 correctly and are ready to start the loop. In the first iteration of your loop, thread 0 and thread 1 read sum from memory as 0, and then add their respective element to sum, and write it back to memory. However, this means that thread 0 is trying to write sum = 1 to memory (the first element is 1, and sum = 0 + 1 = 1), while thread 1 is trying to write sum = 2 to memory (the second element is 2, and sum = 0 + 2 = 2). The end result of this code depends on which one of the threads finishes last, and therefore writes to memory last, which is a race condition. Not only that, but in this particular case, neither of the answers that the code could produce are correct! There are several ways to get around this; I'll detail three basic ones below:
#pragma omp critical:
In OpenMP, there is what is called a critical directive. This restricts the code so that only one thread can do something at a time. For example, your for-loop can be written:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp critical
sum += *tmpsum;
}
This eliminates the race condition as only one thread accesses and writes to sum at a time. However, the critical directive is very very bad for performance, and will likely kill a large portion (if not all) of the gains you get from using OpenMP in the first place.
#pragma omp atomic:
The atomic directive is very similar to the critical directive. The major difference is that, while the critical directive applies to anything that you would like to do one thread at a time, the atomic directive only applies to memory read/write operations. As all we are doing in this code example is reading and writing to sum, this directive will work perfectly:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp atomic
sum += *tmpsum;
}
The performance of atomic is generally significantly better than that of critical. However, it is still not the best option in your particular case.
reduction:
The method you should use, and the method that has already been suggested by others, is reduction. You can do this by changing the for-loop to:
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
sum += *tmpsum;
}
The reduction command tells OpenMP that, while the loop is running, you want each thread to keep track of its own sum variable, and add them all up at the end of the loop. This is the most efficient method as your entire loop now runs in parallel, with the only overhead being right at the end of the loop, when the sum values of each of the threads need to be added up.
Use reduction clause (description at MSDN).
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i=0;i<snum;i++)
{
sum += input[i];
}
I am trying to use OpenMP to add the numbers in an array. The following is my code:
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static)
for(i=0;i<snum;i++)
{
int* tmpsum = input+i;
sum += *tmpsum;
}
This does not produce the right result for sum. What's wrong?
Your code currently has a race condition, which is why the result is incorrect. To illustrate why this is, let's use a simple example:
You are running on 2 threads and the array is int input[4] = {1, 2, 3, 4};. You initialize sum to 0 correctly and are ready to start the loop. In the first iteration of your loop, thread 0 and thread 1 read sum from memory as 0, and then add their respective element to sum, and write it back to memory. However, this means that thread 0 is trying to write sum = 1 to memory (the first element is 1, and sum = 0 + 1 = 1), while thread 1 is trying to write sum = 2 to memory (the second element is 2, and sum = 0 + 2 = 2). The end result of this code depends on which one of the threads finishes last, and therefore writes to memory last, which is a race condition. Not only that, but in this particular case, neither of the answers that the code could produce are correct! There are several ways to get around this; I'll detail three basic ones below:
#pragma omp critical:
In OpenMP, there is what is called a critical directive. This restricts the code so that only one thread can do something at a time. For example, your for-loop can be written:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp critical
sum += *tmpsum;
}
This eliminates the race condition as only one thread accesses and writes to sum at a time. However, the critical directive is very very bad for performance, and will likely kill a large portion (if not all) of the gains you get from using OpenMP in the first place.
#pragma omp atomic:
The atomic directive is very similar to the critical directive. The major difference is that, while the critical directive applies to anything that you would like to do one thread at a time, the atomic directive only applies to memory read/write operations. As all we are doing in this code example is reading and writing to sum, this directive will work perfectly:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp atomic
sum += *tmpsum;
}
The performance of atomic is generally significantly better than that of critical. However, it is still not the best option in your particular case.
reduction:
The method you should use, and the method that has already been suggested by others, is reduction. You can do this by changing the for-loop to:
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
sum += *tmpsum;
}
The reduction command tells OpenMP that, while the loop is running, you want each thread to keep track of its own sum variable, and add them all up at the end of the loop. This is the most efficient method as your entire loop now runs in parallel, with the only overhead being right at the end of the loop, when the sum values of each of the threads need to be added up.
Use reduction clause (description at MSDN).
int* input = (int*) malloc (sizeof(int)*snum);
int sum = 0;
int i;
for(i=0;i<snum;i++){
input[i] = i+1;
}
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i=0;i<snum;i++)
{
sum += input[i];
}
Actually I have two questions, the first one is, considering cache, which one of the following code is faster?
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[i][j]++;
}
}
or
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[j][i]++;
}
}
I am guessing the first one will be much faster since there are a lot less cache miss. And my question is if you are using OpenMP, what kind of technique will you use to optimise such a nested loop? My strategy is to divide the outer loop into 4 chunks and assign them among 4 cores, is there any better way (more cache friendly) to do it?
Thanks!
Bob
As maxihatop pointed out, the first one performs better because it has better cache locality.
Dividing the outer loop into chunks is a good strategy in the case like this, where the complexity of the task inside the loop is constant.
You might want to take a look at #pragma omp for schedule(static). This will evenly divide the iterations contiguously among threads. So your code should look like:
#pragma omp for schedule(static)
for (i = 0; i < 10000; i++) {
for(j = 0; j < 10000; j++){
a[i][j]++;
}
Lawrence Livermore National Laboratory provides a fantastic tutorial of OpenMP. You can find more information there.
https://computing.llnl.gov/tutorials/openMP/