Parallelise nested for loop where inner depends on outer with OpenMP - c

I have a function in C which I have to parallelize using OpenMP with static scheduling for n threads
void resolveCollisions(){
int i,j;
double dx,dy,dz,md;
for(i=0;i<bodies-1;i++)
for(j=i+1;j<bodies;j++){
md = masses[i]+masses[j];
dx = fabs(positions[i].x-positions[j].x);
dy = fabs(positions[i].y-positions[j].y);
dz = fabs(positions[i].z-positions[j].z);
if(dx<md && dy<md && dz<md){
vector temp = velocities[i];
velocities[i] = velocities[j];
velocities[j] = temp;
}
}
}
So in order to parallelize this I added a #pragma omp parallel for directive to parallelize the outer loop across the n threads. I also added the static scheduling tag which I have to use. I also put the num_threads(n) which takes the n from the function parameters to know the desired number of threads. I also thought about adding a critical section to prevent race conditions when updating the velocities array.
void resolveCollisions_openMP-static(int n) {
int i, j;
double dx, dy, dz, md;
#pragma omp parallel for schedule(static) num_threads(n)
for (i = 0; i < bodies - 1; i++) {
for (j = i + 1; j < bodies; j++) {
md = masses[i] + masses[j];
dx = fabs(positions[i].x - positions[j].x);
dy = fabs(positions[i].y - positions[j].y);
dz = fabs(positions[i].z - positions[j].z);
if (dx < md && dy < md && dz < md) {
vector temp = velocities[i];
#pragma omp critical
{
velocities[i] = velocities[j];
velocities[j] = temp;
}
}
}
}
}
When I run this function though it gives me wrong results. I imagine that it has something to do with the inner loop using i, in order to give value to j in j=i+1. I don't know how to approach to fix this or if this is the actual issue or if it's not. I would appreciate any help. Thank you

Related

Optimizing a matrix transpose function with OpenMP

I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)

What's wrong with the omp declaration here? How to fix it?

Here's my code, which allows different threads to compute conv2d and merge the results back to the result matrix.
#pragma omp parallel private(tid)
float *gptr;
gptr = malloc(M * M * sizeof(float) / NUMTHREADS);
tid = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < M; i++)
{
for (int j = 0; j < M; j++)
{
float tmp = 0.;
for (int k = 0; k < GW; k++)
{
int ii = i + k - W2;
for (int l = 0; l < GW; l++)
{
int jj = j + l - W2;
if (ii >= 0 && ii < M && jj >= 0 && jj < M)
{
tmp += float_m[k * M + l] * GK[ii * GW + jj];
}
}
}
*(gptr + (i - tid * M / NUMTHREADS) * M + j) = tmp;
}
}
But the declaration pragma omp parallel private(tid) didn't work properly.
It gives the error message for float declaration next line:
\omp.c: In function 'main':.\omp.c:86:5: error: expected expression before 'float'
float *gptr;
^~~~~
Where did this go wrong and how to fix it?
Your parallel region is longer than a single line, so you have to use curly braces:
#pragma omp parallel private(tid)
{
//your code
}
UPDATE - a more precise answer with references:
From OpenMP specification the syntax of the parallel construct is as follows:
#pragma omp parallel [clause[ [,] clause] ... ] new-line
structured-block
The structured block is:
an executable statement, possibly compound, with a single
entry at the top and a single exit at the bottom, or an OpenMP
construct.
The definition of compound statement:
A compound statement, or block, is a brace-enclosed sequence of
statements and declarations.
In your code
#pragma omp parallel private(tid)
float *gptr;
float *gptr; is not an executable/compound statement/OpenMP construct, therefore you get an error message. You have to create a compound statement by putting your code between { and }.
I see three problems with your code.
Your immediate problem is that you need curly braces around the material of the parallel region.
Less importantly, considering putting a collapse(2) on the i,j loops.
But most importantly, are you sure that allocating gptr in the parallel region is what you want? It means that each thread creates its own copy, and which stays local to the parallel region. You probably want to allocate outside and pass the pointer in as shared.

I am having trouble with OpenMP on C

I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
forces[i][k] += f;
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.
Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)
#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
{
clock_t begin = clock();
#pragma omp for schedule(auto)
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
#pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
#pragma omp single
printf("Calculation time %lf sec\n",time_spent);
}
I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.
Using barriers or other synchronizations will slow down your code, if the amount of unsynchronized work is not larger by a good factor. That is not the case with you. You probably need to reformulate your code to remove synchronization.
You are doing something like an N-body simulation. I've worked out a couple of solutions here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#N-bodyproblems
Also: your d2 loop is a reduction, so you can treat it like that, but it is probably enough if that variable is private to the i,j iterations.
You should always define your variables in their minimal required scope, especially if performance is an issue. (Note that if you do so your compiler can create more efficient code). Besides performance it also helps to avoid data race.
I think you have misplaced a curly brace and the condition in the first for loop should be i<nbodies-1. Variable ene can be summed up using reduction and to avoid data race atomic operations have to be used to increase array forces, so you do not need to use slow barriers or critical sections. Your code should look something like this (assuming int for indices and double for calculation):
#pragma omp parallel for reduction(+:ene)
for(int i=0; i<nbodies-1; ++i){
for(int j=i+1; j<nbodies; ++j) {
double d2 = 0.0;
double rij[3];
for(int k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
double d = sqrt(d2);
double d3 = d*d2;
for(int k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
Solved, turns out all I needed was
#pragma omp parallel for nowait
Doesn't need the "atomic" either.
Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.

Parallelize C code for 2D Haar wavelet transform with OpenMP

This is my first question. I'm trying to parallelize with openMP a 2d haar transform function in C. I obtained it here and modified accordingly.
The program takes a black&white image, puts it into a matrix and computes one level of the haar wavelet transform. In the end it normalizes the values and writes the transformed image on the disk.
This is a resulting image 1 level of HDT
My problem is that the parallelized version runs quite slower than the serial one.
For now I attach here a snippet from the main part I want to parallelize (later on I can put all the surrounding code):
void haar_2d ( int m, int n, double u[] )
// m & n are the dimentions (every image is a perfect square)
//u is the input array in **(non column-major!)** row-major order</del>
int i;
int j;
int k;
double s;
double *v;
int tid, nthreads, chunk;
s = sqrt ( 2.0 );
v = ( double * ) malloc ( m * n * sizeof ( double ) );
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
{
v[i+j*m] = u[i+j*m];
}
}
/*
Determine K, the largest power of 2 such that K <= M.
*/
k = 1;
while ( k * 2 <= m )
{
k = k * 2;
}
/* Transform all columns. */
while ( n/2 < k ) // just 1 level of transformation
{
k = k / 2;
clock_t begin = clock();
#pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid)
{
tid = omp_get_thread_num();
printf("Thread %d starting...\n",tid);
#pragma omp for schedule (dynamic)
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < k; i++ )
{
v[i +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
}
}
#pragma omp for schedule (dynamic)
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < 2 * k; i++ )
{
u[i+j*m] = v[i+j*m];
}
}
}//end parallel
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf ( "Time for COLUMNS: %f ms\n", time_spent * 1000);
}//end while
// [...]code for rows
free ( v );
return;}
The timings more or less are:
Time for COLUMNS: 160.519000 ms // parallel
Time for COLUMNS: 62.842000 ms // serial
I have tried to re-arrange the pragmas in lots of different ways eg with static schedule, with sections, task and so on, also re-arranging the data scopes of the variables and dynamically allocating inside parallel regions.
I thought it would be simple to parallelize a 2-level for, but now it has been two days that I'm struggling. Seeking for your help guys, I've already checked out near all the related questions here, but still not able to go on or, at least, understand the reasons. Thank you in advance.
(CPU Intel Core i3-4005U CPU # 1.70GHz × 4 threads, 2 cores )
UPDATE:
1) What about m & n, it is supposed to implement also rectangled images one day, so I just left it there.
2) I figured out that u is actually a normal array with a linearized matrix inside, that is row by row (I use PGM images).
3) The memcpy is a better option, so now I'm using it.
What about the main topic, I've tried to divide the job over n by spawning a task for each chunk and the result is a littel bit faster thatn the serial code.
Now I know that the input matrix u is in good row-major order, the 2 fors seem to proceed accordingly, but I'm not sure about the timings: using both omp_get_wtime() and clock() I don't know how to measure the speedup. I did tests with different image sizes, from 16x16 up to 4096x4096, and the parallel version seems to be slower with clock() and faster with omp_get_wtime() and gettimeofday().
Do you have some suggestions of how to handle it correctly with OpenMP, or at least how to measure correctly the speedup?
while ( n/2 < k )
{
k = k / 2;
double start_time = omp_get_wtime();
// clock_t begin = clock();
#pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(k)
{
nthreads = omp_get_num_threads();
#pragma omp single
{
printf("Number of threads = %d\n", nthreads);
int chunk = n/nthreads;
printf("Chunks size = %d\n", chunk);
printf("Thread %d is starting the tasks.\n", omp_get_thread_num());
int h;
for(h=0;h<n;h = h + chunk){
printf("FOR CYCLE i=%d\n", h);
#pragma omp task shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(h,k)
{
tid = omp_get_thread_num();
printf("Thread %d starts at %d position\n", tid , h);
for ( j = h; j < h + chunk; j++ )
{
for ( i = 0; i < k; i++ )
{
v[i +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
}
}
}// end task
}//end launching for
#pragma omp taskwait
}//end single
}//end parallel region
// clock_t end = clock();
// double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
// printf ( "COLUMNS: %f ms\n", time_spent * 1000);
double time = omp_get_wtime() - start_time;
printf ( "COLUMNS: %f ms\n", time*1000);
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < 2 * k; i++ )
{
u[i+j*m] = v[i+j*m];
}
}
}//end while
I have a few questions that deeply concern me about your code.
m & n are the dimentions (every image is a perfect square)
Then why are there two size parameters?
u is the input array in column-major order
This is an incredibly bad idea. C uses a row-major ordering for memory, so column-major indexing leads to strided memory access. This is very, very bad for performance. If at all possible, you need to fix this.
Because both u and v are linearized matrices, then this
for (int j = 0; j < n; j++) {
for (int i = 0; i < m; i++) {
v[i + j * m] = u[i + j * m];
}
}
can be replaced with a call to memcpy.
memcpy(v, u, m * n * sizeof(double));
On to your issue. The reason that your version using OpenMP is slower is because all of your threads are doing the same thing. This isn't useful and leads to bad things like false sharing. You need to use each thread's id (tid in your code) to partition the data across the threads; keeping in mind that false sharing is bad.
The problem was that I was using clock() instead of omp_get_wtime(), thanks to Z boson.

Gaussian elimination in OpenMP - Unable to parallelize

I am trying to parallelise gaussian elimination with pivoting using OpenMP.
Below is the relevant section of the code that I wrote:
struct timeval tvBegin, tvEnd;
gettimeofday(&tvBegin, NULL);
for (k=1; k<=n-1; ++k) {
amax = (double) fabs(a[k][k]) ;
m = k;
for (i=k+1; i<=n; i++){ /* Find the row with largest pivot */
xfac = (double) fabs(a[i][k]);
if(xfac > amax) {amax = xfac; m=i;}
}
if(m != k) { /* Row interchanges */
rowx = rowx+1;
temp1 = b[k];
b[k] = b[m];
b[m] = temp1;
for(j=k; j<=n; j++) {
temp = a[k][j];
a[k][j] = a[m][j];
a[m][j] = temp;
}
}
#pragma omp parallel for private(i,j)
for (i=k+1; i<=n; ++i) {
xfac = a[i][k]/a[k][k];
for (j=k+1; j<=n; ++j) {
a[i][j] = a[i][j]-xfac*a[k][j];
}
b[i] = b[i]-xfac*b[k];
} matrix_print_off (n, n, a);}
}
gettimeofday(&tvEnd, NULL);
printf("\nTime elapsed in ms: %d\n", diff_ms(tvEnd, tvBegin));
I tested this code with 1000*1000 matrix. The average time taken for running this code (measured via diff_ms) on a 4 core machine is coming out to be the same (2142ms) as the sequential version (without pragmas) of this code. Since there is immense parallelisation happening here, this shouldn't be the case. Could you please let me know where did I go wrong?
For reference, I have also attached the diff_ms function below.
int diff_ms(struct timeval t1, struct timeval t2)
{
return (((t1.tv_sec - t2.tv_sec) * 1000) +
(t1.tv_usec - t2.tv_usec)/1000);
}
Thanks!
Inside your parallel section, you have matrix_print_off(). Assuming your print function is thread safe, this will significantly reduce the amount of parallelism you can achieve. Additionally, if matrix_print_off() uses blocking IO, then this function's time may dominate the rest of your function.

Resources