I'm trying to implement matrix multiplication with OpenMP.
As i found that usually collapse directive is used in nested for loop.
So i also used collapse as below. (this code is belong to feedforward in neural network)
for (i = 0; i < net->num_layer-1; i++) {
#pragma omp parallel for num_threads(thread) private(j) collapse(2)
for (j = 0; j < net->mini_batch_size; j++) {
for (k = 0; k < net->layer_size[i+1]; k++) {
#pragma omp simd reduction(+:sum)
for (l = 0; l < net->layer_size[i]; l++) {
sum = sum + NEURON(net, i, j, l) * WEIGHT(net, i, l, k);
}
ZS(net, i+1, j, k) = sum + BIAS(net, i+1, k);
NEURON(net, i+1, j, k) = sigmoid(ZS(net, i+1, j, k));
sum = 0.0;
}
}
}
But What i want to know is is there any way to implement parallelize without collapse? I also tried like below but the performance is under my expectation.
omp_set_nested(1);
for (i = 0; i < net->num_layer-1; i++) {
#pragma omp parallel for num_threads(net->mini_batch_size) private(j)
for (j = 0; j < net->mini_batch_size; j++) {
#pragma omp parallel for num_threads(net->layer_size[i+1]) private(j, k)
for (k = 0; k < net->layer_size[i+1]; k++) {
#pragma omp simd reduction(+:sum)
for (l = 0; l < net->layer_size[i]; l++) {
sum = sum + NEURON(net, i, j, l) * WEIGHT(net, i, l, k);
}
ZS(net, i+1, j, k) = sum + BIAS(net, i+1, k);
NEURON(net, i+1, j, k) = sigmoid(ZS(net, i+1, j, k));
sum = 0.0;
}
}
}
In the thread variable is 100 constant. it would be optimized later but i fixed it now. And also, the result of performance is that with collpase, 0.75sec, and without collapse, 25sec.
So, anyway please let me know is there any way to implement nested loop without collpase which results better performance than collapse.
Related
I have a function that I want to parallelize. This is the serial version.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j;
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
b[irem[j] - 1] += xrem[j]*x[i];
}
}
}
I figured a decent way to do this was to have each thread write to a private copy of the b array (which does not need to be a protected critical section because its a private copy), after the thread is done, it will then copy its results to the actual b array. Here is my code.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j, k;
#pragma omp parallel private(i, j, k)
{
float* b_local = (float*)malloc(sizeof(b));
#pragma omp for nowait
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[index] += current_add;
}
}
for (k = 0; k < sizeof(b) / sizeof(b[0]); k++)
{
// Separate question: Is this if statement allowed?
//if (b_local[k] == 0) { continue; }
#pragma omp atomic
b[k] += b_local[k];
}
}
}
However, I get a segmentation fault as a result of the second for loop. I do not need to a "#pragma omp for" on that loop because I want each thread to execute it fully. If I comment out the content inside the for loop, no segmentation fault. I am not sure what the issue would be.
You're probabily trying to access an out of range position in the dynamic array b_local.
See that sizeof(b) will return the size in bytes of float* (size of a float pointer).
If you want to know the size of the array that you are passing to the function, i would suggest you add it to the parameters of the function.
void parallelCSC_SpMV(float *x, float *b, int b_size){
...
float* b_local = (float*) malloc(sizeof(float)*b_size);
...
}
And, if the size of colptrs is numcols i would be careful with colptrs[i+1], since when i=numcols-1 will have another out of range problem.
First, as pointed out by Jim Cownie:
In all of these answers, b_local is uninitialised, yet you are adding
to it. You need to use calloc instead of malloc
Just to add to the accepted answer, I thing you can try the following approach to avoid calling malloc in parallel, and also the overhead of calling #pragma omp atomic.
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads];
for(int i = 0; i < num_threads; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[tid][index] += current_add;
}
}
}
for(int id = 0; id < num_threads; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I have not tested the performance of this, so please feel free to do so and provide feedback.
You can further optimize by instead of creating a local_b for the master thread just reused the original b, as follows:
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads-1];
for(int i = 0; i < num_threads-1; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
float *thread_b = (tid == 0) ? b : b_local[tid-1];
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
thread_b[index] += current_add;
}
}
}
for(int id = 0; id < num_threads-1; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I was doing an activity at my university that requires to populate a matrix of [2000][2000] elements and then calculate the sum of all elements that are multiples of 5 in a parallel way.
At first I tried using a 5 x 5 matrix, I did a parcial sum (sumP) of the elements and them I added all the elements on a variable called Sum into a critical region.
On my university computer the parcial sum was receiving thrash values (like 36501) when the values must be lower than 100; I noted that it only happend on the [0][i] (line zero) of the matrix.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 5
int main() {
int i, j, k, l;
int sum = 0;
int sumP = 0;
int A[N][N];
printf("sumP : %i\n", sumP );
printf("sum: %i\n", sum);
#pragma omp parallel shared (A) private (i, j)
{
#pragma omp for
for (i = 0; i < N; i++) {
for(j = 0; j < N; j++){
A[i][j] = i%5;
printf("Number: %i, pos[%i][%i]\n", A[i][j], i, j);
}
}
}
#pragma omp parallel shared(A, sum) private (k, l, sumP)
{
#pragma omp for
for (k = 0; k < N; k++) {
for (l = 0; l < N; l++){
if (A[l][k] % 5 == 0 && A[l][k] != 0){
sumP = sumP + A[k][l];
printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[k][l], k, l, sumP);
}
}
}
#pragma omp critical
sum += sumP;
}
//printf("sumP: %i\n", sumP);
printf("sum: %i\n", sum);
return (0);
}
I tested it declaring the value of sumP to 0 between the "for" statemants, and it worked:
#pragma omp parallel shared(A, soma) private (k, l, somap2)
{
#pragma omp for
for (k = 0; k < N; k++) {
sumP = 0;
for (l = 0; l < N; l++){
when I tested it home it worked without having to declare the sumP as 0 (on the parcial sum "sumP"), like I did above, but now the final Sum result is not correct...
You observe this behavior because private variables in OpenMP are uninitialized. To be precise, they are initialized as if you would have a local variable without an explicit initialization. Which means it is undefined what value they have initially. You observe different behavior on different systems because some combinations of compiler, options, and OS use this "undefined" differently. Your code is incorrect in any case, even if it sometimes produces the correct result.
Now you can do this setting to zero as you tried out. However, I would generally suggest to instead declare variables as local as possible. This makes reasoning about the (parallel) code much easier, and you can omit the "private/shared" declarations. So your code would look like this:
#pragma omp parallel
{
int sumP = 0;
#pragma omp for
for (int k = 0; k < N; k++) {
for (int l = 0; l < N; l++) {
if (A[l][k] % 5 == 0 && A[l][k] != 0) {
sumP = sumP + A[k][l];
printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[k][l], k, l, sumP);
}
}
}
#pragma omp critical
sum += sumP;
}
In addition to that, there is another way to drastically simplify this code by using a reduction:
#pragma omp parallel for reduction(+:sum)
for (int k = 0; k < N; k++) {
for (int l = 0; l < N; l++) {
if (A[l][k] % 5 == 0 && A[l][k] != 0) {
sum += A[k][l];
}
}
}
The compiler will basically do the same thing for you (but better) and the code is much cleaner.
Considering that your code would spend most of its time dealing with I/O it would be a good idea to comment the printf
But as I understand sumP should contain the partial sum of your inner loop
Pragmas have been compressed for readability
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 1000
int main() {
int i, j;
int sum = 0;
int sumP = 0;
int A[N][N]; // will cause segfault with large N
printf("sumP : %i\n", sumP );
printf("sum: %i\n", sum);
#pragma omp parallel for shared (A) private (i, j)
for (i = 0; i < N; i++) {
for(j = 0; j < N; j++){
A[i][j] = i%5; // populate array with numbers in [0,1,2,3,4]
//printf("Number: %i, pos[%i][%i]\n", A[i][j], i, j);
}
}
#pragma omp parallel for shared(A) private (i, j, sumP) reduction(+: sum)
for (i = 0; i< N; i++) { // outer (parallel)loop
sumP = 0; // initialize partial sum
for (j = 0; j < N; j++){ // inner sequential loop
//if (A[i][j] % 5 == 0 && A[i][j] != 0){ // Explain this condition
sumP += A[i][j];
//printf("numero: %i, pos [%i],[%i] sumP: %i\n", A[i][j], i, j, sumP);
//}
}
//printf("sumP: %i\n", sumP);
sum += sumP; // add partial sum
}
//printf("sumP: %i\n", sumP);
printf("sum: %i\n", sum);
return (0);
}
I have two loops which I am parallelizing
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Strangely the sequential execution is much faster than parallel version above even using large number of threads. Am I doing something wrong? Note that all arrays are global. Does this make that difference?
The iterations of your parallel outer loops share the index variables (j and k) of their inner loops. This for sure makes your code somewhat slower than you probably expected it to be, i.e., your loops are not "embarrassingly" (or "delightfully") parallel and parallel loop iterations need to somehow accesses these variables from shared memory.
What is worse, is that, because of this, your code contains race conditions. As a result, it will behave nondeterministically. In other words: your implementation of parallel matrix multiplication is now incorrect! (Go ahead and check the results of your computations. ;))
What you want to do is make sure that all iterations of your outer loops have their own private copies of the index variables j and k. You can achieve this either by declaring these variables within the scope of the parallel loops:
int i;
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j1, k1; /* explicit local copies */
for (j1 = 0; j1 < nj; j1++) {
C[i][j1] = 0;
for (k1 = 0; k1 < nk; ++k1)
C[i][j1] += A[i][k1] * B[k1][j1];
}
}
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j2, k2; /* explicit local copies */
for (j2 = 0; j2 < nl; j2++) {
E[i][j2] = 0;
for (k2 = 0; k2 < nj; ++k2)
E[i][j2] += C[i][k2] * D[k2][j2];
}
}
or otherwise declaring them as private in your loop pragmas:
int i, j, k;
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Will these changes make your parallel implementation faster than your sequential implementation? Hard to say. It depends on your problem size. Parallelisation (in particular parallelisation through OpenMP) comes with some overhead. Only if you spawn enough parallel work, the gain of distributing work over parallel threads will outweigh the incurred overhead costs.
To find out how much work is enough for your code and your software/hardware platform, I advise to experiment by running your code with different matrix sizes. Then, if you also expect "too" small matrix sizes as inputs for your computation you may want to make parallel processing conditional (for example, by decorating your loop pragmas with an if-clauses):
#pragma omp parallel for private (j, k) if(ni * nj * nk > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}
#pragma omp parallel for private (j, k) if(ni * nl * nj > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}
How do i parallelize this function in OpenMP for C
int zeroRow(int**A,int n) {
int i, j, sum, num = 0;
for(i= 0;i< n;i++) {
sum = 0;
for(j = 0; j < n; j++) {
sum += A[i][j];
}
if(sum == 0) {
num++;
}
}
return num;
}
I did this check if this is the right procedure.
int zeroRow(int**A,int n) {
int num = 0;
#pragma omp parallel for reduction(+:num);
for(int i= 0;i< n;i++) {
int sum = 0;
for(int j = 0; j < n; j++) {
sum += A[i][j];
}
if(sum == 0) {
num++;
}
}
return num;
}
please tell me if what i have done is right or wring i have parallelized the outer loop using reduction and a separate num is given to each thread.
Looks correct parallelized.
The only thing you should add is a term specifying the use of A.
You rely that the default case is shared. You should explicitly name the status with
#pragma omp parallel for reduction(+:num) default(shared)
or
#pragma omp parallel for reduction(+:num) shared(A)
also you do not need to write a semicolon (;) at the end of the pragma line (but writing it would be no error)
I am trying to multiply two matrices using OpenMP task. This is a basic code:
long i, j, k;
for (i = 0; i < N; i ++)
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
So, I want to use task on column level and then I modified code like this:
long i, j, k;
#pragma omp parallel
{
#pragma omp single
{
for (i = 0; i < N; i ++)
#pragma omp task private(i, j, k)
{
for (j = 0; j < N; j ++)
for (k = 0; k < N; k ++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}
When I run a program I get message like this:
Segmentation fault (core dumped)
Now, I know I'm missing some piece, but can't figure it what. Any idea?
private variables in OpenMP are not initialised and have random initial values. When the task executes, i would have random value and therefore probably lead to an out-of-bound access of c[] and a[].
firstprivate variables are similar to private, but have their initial value set to the value that the referenced variable had at the moment the construct is entered. In your case i has to be firstprivate and not private.
Also it is advisable that i is made private in the parallel region for a small performance increase. Thus the final code should look like this (with all variable sharing classes written explicitly and private variables declared in their use scopes):
#pragma omp parallel shared(a, b, c)
{
#pragma omp single
{
long i;
for (i = 0; i < N; i++)
#pragma omp task shared(a, b, c) firstprivate(i)
{
int j, k;
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}