Gaussian elimination in OpenMP - Unable to parallelize - c

I am trying to parallelise gaussian elimination with pivoting using OpenMP.
Below is the relevant section of the code that I wrote:
struct timeval tvBegin, tvEnd;
gettimeofday(&tvBegin, NULL);
for (k=1; k<=n-1; ++k) {
amax = (double) fabs(a[k][k]) ;
m = k;
for (i=k+1; i<=n; i++){ /* Find the row with largest pivot */
xfac = (double) fabs(a[i][k]);
if(xfac > amax) {amax = xfac; m=i;}
}
if(m != k) { /* Row interchanges */
rowx = rowx+1;
temp1 = b[k];
b[k] = b[m];
b[m] = temp1;
for(j=k; j<=n; j++) {
temp = a[k][j];
a[k][j] = a[m][j];
a[m][j] = temp;
}
}
#pragma omp parallel for private(i,j)
for (i=k+1; i<=n; ++i) {
xfac = a[i][k]/a[k][k];
for (j=k+1; j<=n; ++j) {
a[i][j] = a[i][j]-xfac*a[k][j];
}
b[i] = b[i]-xfac*b[k];
} matrix_print_off (n, n, a);}
}
gettimeofday(&tvEnd, NULL);
printf("\nTime elapsed in ms: %d\n", diff_ms(tvEnd, tvBegin));
I tested this code with 1000*1000 matrix. The average time taken for running this code (measured via diff_ms) on a 4 core machine is coming out to be the same (2142ms) as the sequential version (without pragmas) of this code. Since there is immense parallelisation happening here, this shouldn't be the case. Could you please let me know where did I go wrong?
For reference, I have also attached the diff_ms function below.
int diff_ms(struct timeval t1, struct timeval t2)
{
return (((t1.tv_sec - t2.tv_sec) * 1000) +
(t1.tv_usec - t2.tv_usec)/1000);
}
Thanks!

Inside your parallel section, you have matrix_print_off(). Assuming your print function is thread safe, this will significantly reduce the amount of parallelism you can achieve. Additionally, if matrix_print_off() uses blocking IO, then this function's time may dominate the rest of your function.

Related

Parallelise nested for loop where inner depends on outer with OpenMP

I have a function in C which I have to parallelize using OpenMP with static scheduling for n threads
void resolveCollisions(){
int i,j;
double dx,dy,dz,md;
for(i=0;i<bodies-1;i++)
for(j=i+1;j<bodies;j++){
md = masses[i]+masses[j];
dx = fabs(positions[i].x-positions[j].x);
dy = fabs(positions[i].y-positions[j].y);
dz = fabs(positions[i].z-positions[j].z);
if(dx<md && dy<md && dz<md){
vector temp = velocities[i];
velocities[i] = velocities[j];
velocities[j] = temp;
}
}
}
So in order to parallelize this I added a #pragma omp parallel for directive to parallelize the outer loop across the n threads. I also added the static scheduling tag which I have to use. I also put the num_threads(n) which takes the n from the function parameters to know the desired number of threads. I also thought about adding a critical section to prevent race conditions when updating the velocities array.
void resolveCollisions_openMP-static(int n) {
int i, j;
double dx, dy, dz, md;
#pragma omp parallel for schedule(static) num_threads(n)
for (i = 0; i < bodies - 1; i++) {
for (j = i + 1; j < bodies; j++) {
md = masses[i] + masses[j];
dx = fabs(positions[i].x - positions[j].x);
dy = fabs(positions[i].y - positions[j].y);
dz = fabs(positions[i].z - positions[j].z);
if (dx < md && dy < md && dz < md) {
vector temp = velocities[i];
#pragma omp critical
{
velocities[i] = velocities[j];
velocities[j] = temp;
}
}
}
}
}
When I run this function though it gives me wrong results. I imagine that it has something to do with the inner loop using i, in order to give value to j in j=i+1. I don't know how to approach to fix this or if this is the actual issue or if it's not. I would appreciate any help. Thank you

I am having trouble with OpenMP on C

I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
forces[i][k] += f;
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.
Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)
#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
{
clock_t begin = clock();
#pragma omp for schedule(auto)
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
#pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
#pragma omp single
printf("Calculation time %lf sec\n",time_spent);
}
I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.
Using barriers or other synchronizations will slow down your code, if the amount of unsynchronized work is not larger by a good factor. That is not the case with you. You probably need to reformulate your code to remove synchronization.
You are doing something like an N-body simulation. I've worked out a couple of solutions here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#N-bodyproblems
Also: your d2 loop is a reduction, so you can treat it like that, but it is probably enough if that variable is private to the i,j iterations.
You should always define your variables in their minimal required scope, especially if performance is an issue. (Note that if you do so your compiler can create more efficient code). Besides performance it also helps to avoid data race.
I think you have misplaced a curly brace and the condition in the first for loop should be i<nbodies-1. Variable ene can be summed up using reduction and to avoid data race atomic operations have to be used to increase array forces, so you do not need to use slow barriers or critical sections. Your code should look something like this (assuming int for indices and double for calculation):
#pragma omp parallel for reduction(+:ene)
for(int i=0; i<nbodies-1; ++i){
for(int j=i+1; j<nbodies; ++j) {
double d2 = 0.0;
double rij[3];
for(int k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
double d = sqrt(d2);
double d3 = d*d2;
for(int k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
Solved, turns out all I needed was
#pragma omp parallel for nowait
Doesn't need the "atomic" either.
Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.

OpenMP: 2 Nested For loops inside of a While loop. How to fix for multi-threaded functionality? (Jacobi Solver)

I'm attempting to parallelize a Jacobi grid solver using OpenMP.
When 1 thread is used:
As it stands, the code executes correctly when only a single thread is assigned, and produces the same results as a reference single-threaded function (not shown).
The while loop breaks when the difference variable is less than "0.01000"(as it should)
When two or more threads are used:
The code runs through the outer while loop only once.
The difference value from the first thread is way above 0.0100 (as it should be), but the difference value given by the other thread(s) is below it instantaneously, so the loop breaks, without doing any of the calculations.
I've tested a lot compilation iterations of strategically placing respective variables in either the shared/private/reduction clauses hoping to get the diff value to accumulate correctly over all threads used. I get that the "diff" variable should be shared by all threads, but what I've tried has not worked to accumulate the values from all threads. I'm not sure what else I can try?
Thanks for your time and input
int
compute_using_omp_jacobi (grid_t *grid, int num_threads)
{
/////////////////////////////////////////////////////////
int i, j;
int num_iter = 0;
int done = 0;
double diff;
float old, new;
float eps = 1e-2; /* Convergence criteria. */
int num_elements;
omp_set_num_threads(num_threads);
#pragma omp parallel default(none) shared(grid, eps, done, diff) private ( i, j, old, new, num_elements) reduction (+:num_iter)
while(!done) { /* While we have not converged yet. */
diff = 0.0;
num_elements = 0;
#pragma omp for reduction (+: diff) collapse(2)
for (i = 1; i < (grid->dim - 1); i++)
for (j = 1; j < (grid->dim - 1); j++) {
old = grid->element[i * grid->dim + j]; /* Store old value of grid point. */
/* Apply the update rule. */
new = 0.25 * (grid->element[(i - 1) * grid->dim + j] +\
grid->element[(i + 1) * grid->dim + j] +\
grid->element[i * grid->dim + (j + 1)] +\
grid->element[i * grid->dim + (j - 1)]);
grid->element[i * grid->dim + j] = new; /* Update the grid-point value. */
diff = diff + fabs(new - old); /* Calculate the difference in values. */
num_elements++;
//printf ("DIFF %f.", diff);
}
/* End of an iteration. Check for convergence. */
diff = diff/num_elements;
printf ("Iteration %d. DIFF: %f.\n", num_iter, diff);
// printf ("number of elements %d.", num_elements);
num_iter++;
if (diff < eps)
done = 1;
}
return num_iter;
}
You can't parallelize the while loop, since the values for grid->element in each iteration depend on the values from the previous iteration.
You'll have to move the #pragma omp parallel inside the while (to before the first for loop).
num_elements should be named in the reduction clause, and new and old should be declared within the body of the inner for loop.

C pragma omp parallel

I'm just started with OpenMP and I need help.
I have a program and I need to parallelize it. This is what I have:
#include <stdio.h>
#include <sys/time.h>
#include <omp.h>
#define N1 3000000
#define it 5
struct timeval t0, t1;
int i, itera_kop;
int A[N1], B[N1];
void Exe_Denbora(char * pTestu, struct timeval *pt0, struct timeval *pt1)
{
double tej;
tej = (pt1->tv_sec - pt0->tv_sec) + (pt1->tv_usec - pt0->tv_usec) / 1e6;
printf("%s = %10.3f ms (%d hari)\n",pTestu, tej*1000, omp_get_max_threads());
}
void sum(char * pTestu, int *b, int n)
{
double bat=0;
int i;
for (i=0; i<n; i++) bat+=b[i];
printf ("sum: %.1f\n",bat);
}
main ()
{
for (itera_kop=1;itera_kop<it;itera_kop++)
{
for(i=0; i<N1; i++)
{
A[i] = 1;
B[i] = 3;
}
gettimeofday(&t0, 0);
#pragma omp parallel for private(i)
for(i=2; i<N1; i++)
{
A[i] = 35 / (7/B[i-1] + 2/A[i]);
B[i] = B[i] / (A[i-1]+2) + 3 / B[i];
}
gettimeofday(&t1, 0);
Exe_Denbora("T1",&t0,&t1);
printf ("\n");
}
printf("\n\n");
sum("A",A,N1);
sum("B",B,N1);
}
If I execute the code without using #pragma omp parallel for I get:
A sum: 9000005.5
B sum: 3000005.5
But if I try to parallelize the code I get:
A sum: 9000284.0
B sum: 3000036.0
using 32 threads.
I would like to know why I can't parallelize the code that way
As you are likely aware, your problem is in this for loop. You have dependency between the two lines in the loop.
for(i=2; i<N1; i++)
{
A[i] = 35 / (7/B[i-1] + 2/A[i]);
B[i] = B[i] / (A[i-1]+2) + 3 / B[i];
}
We cannot know the order in which any given thread reaches one of those two lines. Therefore, as an example, when the second line executes, the value in B[i] will be different depending on if A[i-1] has already been changed or not by another thread. The same can be said of A[i]'s dependency on the value of B[i-1]. A short and clear explanation of dependencies can be found at the following link. I would recommend you take a look if this still is not clear. https://scs.senecac.on.ca/~gpu621/pages/content/omp_2.html

Parallelize C code for 2D Haar wavelet transform with OpenMP

This is my first question. I'm trying to parallelize with openMP a 2d haar transform function in C. I obtained it here and modified accordingly.
The program takes a black&white image, puts it into a matrix and computes one level of the haar wavelet transform. In the end it normalizes the values and writes the transformed image on the disk.
This is a resulting image 1 level of HDT
My problem is that the parallelized version runs quite slower than the serial one.
For now I attach here a snippet from the main part I want to parallelize (later on I can put all the surrounding code):
void haar_2d ( int m, int n, double u[] )
// m & n are the dimentions (every image is a perfect square)
//u is the input array in **(non column-major!)** row-major order</del>
int i;
int j;
int k;
double s;
double *v;
int tid, nthreads, chunk;
s = sqrt ( 2.0 );
v = ( double * ) malloc ( m * n * sizeof ( double ) );
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
{
v[i+j*m] = u[i+j*m];
}
}
/*
Determine K, the largest power of 2 such that K <= M.
*/
k = 1;
while ( k * 2 <= m )
{
k = k * 2;
}
/* Transform all columns. */
while ( n/2 < k ) // just 1 level of transformation
{
k = k / 2;
clock_t begin = clock();
#pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid)
{
tid = omp_get_thread_num();
printf("Thread %d starting...\n",tid);
#pragma omp for schedule (dynamic)
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < k; i++ )
{
v[i +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
}
}
#pragma omp for schedule (dynamic)
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < 2 * k; i++ )
{
u[i+j*m] = v[i+j*m];
}
}
}//end parallel
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf ( "Time for COLUMNS: %f ms\n", time_spent * 1000);
}//end while
// [...]code for rows
free ( v );
return;}
The timings more or less are:
Time for COLUMNS: 160.519000 ms // parallel
Time for COLUMNS: 62.842000 ms // serial
I have tried to re-arrange the pragmas in lots of different ways eg with static schedule, with sections, task and so on, also re-arranging the data scopes of the variables and dynamically allocating inside parallel regions.
I thought it would be simple to parallelize a 2-level for, but now it has been two days that I'm struggling. Seeking for your help guys, I've already checked out near all the related questions here, but still not able to go on or, at least, understand the reasons. Thank you in advance.
(CPU Intel Core i3-4005U CPU # 1.70GHz × 4 threads, 2 cores )
UPDATE:
1) What about m & n, it is supposed to implement also rectangled images one day, so I just left it there.
2) I figured out that u is actually a normal array with a linearized matrix inside, that is row by row (I use PGM images).
3) The memcpy is a better option, so now I'm using it.
What about the main topic, I've tried to divide the job over n by spawning a task for each chunk and the result is a littel bit faster thatn the serial code.
Now I know that the input matrix u is in good row-major order, the 2 fors seem to proceed accordingly, but I'm not sure about the timings: using both omp_get_wtime() and clock() I don't know how to measure the speedup. I did tests with different image sizes, from 16x16 up to 4096x4096, and the parallel version seems to be slower with clock() and faster with omp_get_wtime() and gettimeofday().
Do you have some suggestions of how to handle it correctly with OpenMP, or at least how to measure correctly the speedup?
while ( n/2 < k )
{
k = k / 2;
double start_time = omp_get_wtime();
// clock_t begin = clock();
#pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(k)
{
nthreads = omp_get_num_threads();
#pragma omp single
{
printf("Number of threads = %d\n", nthreads);
int chunk = n/nthreads;
printf("Chunks size = %d\n", chunk);
printf("Thread %d is starting the tasks.\n", omp_get_thread_num());
int h;
for(h=0;h<n;h = h + chunk){
printf("FOR CYCLE i=%d\n", h);
#pragma omp task shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(h,k)
{
tid = omp_get_thread_num();
printf("Thread %d starts at %d position\n", tid , h);
for ( j = h; j < h + chunk; j++ )
{
for ( i = 0; i < k; i++ )
{
v[i +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
}
}
}// end task
}//end launching for
#pragma omp taskwait
}//end single
}//end parallel region
// clock_t end = clock();
// double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
// printf ( "COLUMNS: %f ms\n", time_spent * 1000);
double time = omp_get_wtime() - start_time;
printf ( "COLUMNS: %f ms\n", time*1000);
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < 2 * k; i++ )
{
u[i+j*m] = v[i+j*m];
}
}
}//end while
I have a few questions that deeply concern me about your code.
m & n are the dimentions (every image is a perfect square)
Then why are there two size parameters?
u is the input array in column-major order
This is an incredibly bad idea. C uses a row-major ordering for memory, so column-major indexing leads to strided memory access. This is very, very bad for performance. If at all possible, you need to fix this.
Because both u and v are linearized matrices, then this
for (int j = 0; j < n; j++) {
for (int i = 0; i < m; i++) {
v[i + j * m] = u[i + j * m];
}
}
can be replaced with a call to memcpy.
memcpy(v, u, m * n * sizeof(double));
On to your issue. The reason that your version using OpenMP is slower is because all of your threads are doing the same thing. This isn't useful and leads to bad things like false sharing. You need to use each thread's id (tid in your code) to partition the data across the threads; keeping in mind that false sharing is bad.
The problem was that I was using clock() instead of omp_get_wtime(), thanks to Z boson.

Resources