I have implemented the BW and FW algorithms to solve L and U triangular matrix.
The algorithm that I implement run very fast in a serial way, but I can not figure out if this is the best method to parallelize it.
I think that I have taken into account every possible data race (on alpha), am I right?
void solveInverse (double **U, double **L, double **P, int rw, int cw) {
double **inverseA = allocateMatrix(rw,cw);
double* x = allocateArray(rw);
double* y = allocateArray(rw);
double alpha;
//int i, j, t;
// Iterate along the column , so at each iteration we generate a column of the inverse matrix
for (int j = 0; j < rw; j++) {
// Lower triangular solve Ly=P
y[0] = P[0][j];
#pragma omp parallel for reduction(+:alpha)
for (int i = 1; i < rw; i++) {
alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
// Upper triangular solve Ux=P
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp parallel for reduction(+:alpha)
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
freeMemory(inverseA,rw);
free(x);
free(y);
}
After a private discussion with the user dreamcrash, we have come to the solution proposed in his comments, creating a couple of vector x and y for each thread that will work indipendently on a single column.
After a discussion with the OP on the comments (that were removed afterwards), we both came to the conclusion that:
You do not need to reduce the alpha variable, because outside the first parallel region it is initialized again to zero. Instead, make the alpha variable private.
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
and the same applies to the second parallel region as well.
#pragma omp parallel for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
Instead of having one parallel region per j iteration. You can extract the parallel region to encapsulate the entire outermost loop, and use #pragma omp for instead of #pragma omp parallel for. Notwithstanding, although with this approach we reduced the number of parallel regions created from rw to only 1, the speedup achieved with this optimization should not be that significant, because an efficient OpenMP implementation will use a pool of threads, where the threads are initialized on the first parallel region but reused on the next parallel regions. Consequently, saving on the overhead of creating and destroying threads.
#pragma omp parallel
{
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
#pragma omp for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
#pragma omp for
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
}
I have shown you this code transformations so that you could see some potential tricks that you can use on other future parallelizations. Unfortunately, as it is that parallelization will not work.
Why?
Let us look at the first loop:
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
there is a dependency between y[t] being read in alpha += L[i][t] * y[t]; and y[i] being written in y[i] = P[i][j] - alpha;.
So what you can do instead is to parallelize the outermost loop (i.e., assign each column to the threads) and create separate x and y arrays for each thread so that there is no race-conditions during the updates/reads of those arrays.
#pragma omp parallel
{
double* x = allocateArray(rw);
double* y = allocateArray(rw);
#pragma omp for
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
for (int i = rw-2; i >= 0; i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
free(x);
free(y);
}
Related
A is a 2D array, n is the matrix size and we are dealing with a square matrix. threads are number of threads the user input
#pragma omp parallel for shared(A,n,k) private(i) schedule(static) num_threads(threads)
for(k = 0; k < n - 1; ++k) {
// for the vectoriser
for(i = k + 1; i < n; i++) {
A[i][k] /= A[k][k];
}
for(i = k + 1; i < n; i++) {
long long int j;
const double Aik = A[i][k];
for(j = k + 1; j < n; j++) {
A[i][j] -= Aik * A[k][j];
}
}
}
i tried using collapse but failed the error it was showing was work-sharing region may not be closely nested inside of work-sharing, ‘loop’, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or ‘task loop’ region.
After what i though was correct, the time increased as i executed the code with more threads.
I tired using collapse. this is the output:
26:17: error: collapsed loops not perfectly nested before ‘for’
26 | for(i = k + 1; i < n; i++) {
This is LU-Decomposition
I have a function that I want to parallelize. This is the serial version.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j;
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
b[irem[j] - 1] += xrem[j]*x[i];
}
}
}
I figured a decent way to do this was to have each thread write to a private copy of the b array (which does not need to be a protected critical section because its a private copy), after the thread is done, it will then copy its results to the actual b array. Here is my code.
void parallelCSC_SpMV(float *x, float *b)
{
int i, j, k;
#pragma omp parallel private(i, j, k)
{
float* b_local = (float*)malloc(sizeof(b));
#pragma omp for nowait
for(i = 0; i < numcols; i++)
{
for(j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++)
{
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[index] += current_add;
}
}
for (k = 0; k < sizeof(b) / sizeof(b[0]); k++)
{
// Separate question: Is this if statement allowed?
//if (b_local[k] == 0) { continue; }
#pragma omp atomic
b[k] += b_local[k];
}
}
}
However, I get a segmentation fault as a result of the second for loop. I do not need to a "#pragma omp for" on that loop because I want each thread to execute it fully. If I comment out the content inside the for loop, no segmentation fault. I am not sure what the issue would be.
You're probabily trying to access an out of range position in the dynamic array b_local.
See that sizeof(b) will return the size in bytes of float* (size of a float pointer).
If you want to know the size of the array that you are passing to the function, i would suggest you add it to the parameters of the function.
void parallelCSC_SpMV(float *x, float *b, int b_size){
...
float* b_local = (float*) malloc(sizeof(float)*b_size);
...
}
And, if the size of colptrs is numcols i would be careful with colptrs[i+1], since when i=numcols-1 will have another out of range problem.
First, as pointed out by Jim Cownie:
In all of these answers, b_local is uninitialised, yet you are adding
to it. You need to use calloc instead of malloc
Just to add to the accepted answer, I thing you can try the following approach to avoid calling malloc in parallel, and also the overhead of calling #pragma omp atomic.
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads];
for(int i = 0; i < num_threads; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
b_local[tid][index] += current_add;
}
}
}
for(int id = 0; id < num_threads; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I have not tested the performance of this, so please feel free to do so and provide feedback.
You can further optimize by instead of creating a local_b for the master thread just reused the original b, as follows:
void parallelCSC_SpMV(float *x, float *b, int b_size, int num_threads) {
float* b_local[num_threads-1];
for(int i = 0; i < num_threads-1; i++)
b_local[i] = calloc(b_size, sizeof(float));
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
float *thread_b = (tid == 0) ? b : b_local[tid-1];
#pragma omp for
for(int i = 0; i < numcols; i++){
for(int j = colptrs[i] - 1; j < colptrs[i+1] - 1; j++){
float current_add = xrem[j]*x[i];
int index = irem[j] - 1;
thread_b[index] += current_add;
}
}
}
for(int id = 0; id < num_threads-1; id++)
{
#pragma omp for simd
for (int k = 0; k < b_size; k++)
{
b[k] += b_local[id][k];
}
free(b_local[id]);
}
}
I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.
For each of the following code segments, use OpenMP pragmas to make the loop parallel, or
explain why the code segment is not suitable for parallel execution.
a. for (i = 0; i < sqrt(x); i++)
a[i] = 2.3 * i;
if (i < 10)
b[i] = a[i];
}
b. flag = 0;
for (i = 0; i < n && !flag; i++)
a[i] = 2.3 * i;
if (a[i] < b[i])
flag = 1;
}
c. for (i = 0; i < n && !flag; i++)
a[i] = foo(i);
d. for (i = 0; i < n && !flag; i++) {
a[i] = foo(i);
if (a[i] < b[i])
a[i] = b[i];
}
e. for (i = 0; i < n && !flag; i++) {
a[i] = foo(i);
if (a[i] < b[i])
break;
}
f. dotp = 0;
for (i = 0; i < n; i++)
dotp += a[i] * b[i];
g. for (i = k; i < 2 * k; i++)
a[i] = a[i] + a[i – k];
h. for (i = k; i < n; i++) {
a[i] = c * a[i – k];
Any help regarding the above question would be very much welcome..any line of thinking..
I will not do your HW, but I will give a hint. When playing around with OpenMp for loops, you should be alert about the scope of the variables. For example:
#pragma omp parallel for
for(int x=0; x < width; x++)
{
for(int y=0; y < height; y++)
{
finalImage[x][y] = RenderPixel(x,y, &sceneData);
}
}
is OK, since x and y are private variables.
What about
int x,y;
#pragma omp parallel for
for(x=0; x < width; x++)
{
for(y=0; y < height; y++)
{
finalImage[x][y] = RenderPixel(x,y, &sceneData);
}
}
?
Here, we have defined x and y outside of the for loop. Now consider y. Every thread will access/write it without any synchronization, thus data races will occur, which are very likely to result in logical errors.
Read more here and good luck with your HW.
I have two loops which I am parallelizing
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Strangely the sequential execution is much faster than parallel version above even using large number of threads. Am I doing something wrong? Note that all arrays are global. Does this make that difference?
The iterations of your parallel outer loops share the index variables (j and k) of their inner loops. This for sure makes your code somewhat slower than you probably expected it to be, i.e., your loops are not "embarrassingly" (or "delightfully") parallel and parallel loop iterations need to somehow accesses these variables from shared memory.
What is worse, is that, because of this, your code contains race conditions. As a result, it will behave nondeterministically. In other words: your implementation of parallel matrix multiplication is now incorrect! (Go ahead and check the results of your computations. ;))
What you want to do is make sure that all iterations of your outer loops have their own private copies of the index variables j and k. You can achieve this either by declaring these variables within the scope of the parallel loops:
int i;
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j1, k1; /* explicit local copies */
for (j1 = 0; j1 < nj; j1++) {
C[i][j1] = 0;
for (k1 = 0; k1 < nk; ++k1)
C[i][j1] += A[i][k1] * B[k1][j1];
}
}
#pragma omp parallel for
for (i = 0; i < ni; i++) {
int j2, k2; /* explicit local copies */
for (j2 = 0; j2 < nl; j2++) {
E[i][j2] = 0;
for (k2 = 0; k2 < nj; ++k2)
E[i][j2] += C[i][k2] * D[k2][j2];
}
}
or otherwise declaring them as private in your loop pragmas:
int i, j, k;
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
C[i][j] = 0;
for (k = 0; k < nk; ++k)
C[i][j] += A[i][k] * B[k][j];
}
#pragma omp parallel for private(j, k)
for (i = 0; i < ni; i++)
for (j = 0; j < nl; j++) {
E[i][j] = 0;
for (k = 0; k < nj; ++k)
E[i][j] += C[i][k] * D[k][j];
}
Will these changes make your parallel implementation faster than your sequential implementation? Hard to say. It depends on your problem size. Parallelisation (in particular parallelisation through OpenMP) comes with some overhead. Only if you spawn enough parallel work, the gain of distributing work over parallel threads will outweigh the incurred overhead costs.
To find out how much work is enough for your code and your software/hardware platform, I advise to experiment by running your code with different matrix sizes. Then, if you also expect "too" small matrix sizes as inputs for your computation you may want to make parallel processing conditional (for example, by decorating your loop pragmas with an if-clauses):
#pragma omp parallel for private (j, k) if(ni * nj * nk > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}
#pragma omp parallel for private (j, k) if(ni * nl * nj > THRESHOLD)
for (i = 0; i < ni; i++) {
...
}