OpenMP Parallization is slower than sequential - c

I implemented the Dijkstra's algorithm In C. I'm trying to compare the runtime with and without using OpenMP, but for some reason OpenMP is always slower. I read something about how expensive new threads are but expanding the graph does not solve it. I would like to use omp here
#pragma omp parallel for
for(index=0; index<nodes[n].size; index++){
int ct = nodes[n].paths[index].connectsTo;
if(notVisited[ct]){
int dist = dis[n]+nodes[n].paths[index].weight;
if(dist<dis[ct]){
prev[ct] = n;
dis[ct] = dist;
}
}
}

Related

Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
for(i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
}
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
for(i=0;i<nx;i++){
next[i] = (char*)(next+nx) + i*ny;
}
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(i=1;i<nx-1;i++){
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}
Is there any problematic mistake in my way? How can I improve this?
In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
}
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}

Execute for loop iterations in openmp in order with dynamic schedule

I'd like to run a for loop in openmp with dynamic schedule.
#pragma omp for schedule(dynamic,chunk) private(i) nowait
for(i=0;i<n;i++){
//loop code here
}
and I'd like to have each thread executing ordered chunks such that
e.g. thread 1 -> iterations 0 to k
thread2 -> iterations k+1->k+chunk
etc..
Static schedule partly does what I want but I'd like to dynamically load balance the iterations.
Neither ordered clause, if I understood correctly what it does.
My question is how to make sure that the chunks assigned are ordered chunks?
I am using openmp 3.1 with gcc
You can implement this yourself without resorting to omp for, which is considered a convenience function by expert OpenMP programmers.
The following roughly illustrates what you might do. Please check the arithmetic carefully.
#pragma omp parallel
{
int me = omp_get_thread_num();
int nt = omp_get_num_threads();
int chunk = /* divide n by nt appropriately */
int start = me * chunk;
int end = (me+1) * chunk;
if (end > n) end = n;
for (int i = start; i < end; i++) {
/* do work */
}
} /* end parallel */
This does not do any dynamic load-balancing. You can do that yourself by assigning loop iterations unevenly to threads if you know the cost function a priori. You might read up on the inspector-executor model (e.g. 1).

Splitting OpenMP threads on unbalanced tree

I am trying to make tree operations like summing up numbers in all the leaves in a tree work in parallel using OpenMP. The problem I encounter is that the tree I work on is unbalanced (number of children vary and then how big branches are vary as well).
I currently have recursive functions working on those trees. What I am trying to achieve is this:
1)Split the threads at first possible opportunity, say it's a node with 2 children
2)Continue splitting from both resulting threads for at least 2-3 levels so all the threads are at work
It would look like this:
if (node->depth <= 3) {
#pragma omp parallel
{
#pragma omp schedule(dynamic)
for (int i = 0; i < node->children_no; i++) {
int local_sum;
local_sum = sum_numbers(node->children[i])
#pragma omp critical
{
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
The problem here is that when I allow nested parallelism it seems OpenMP creates a lot of threads in new teams. What I would like to achieve is this:
1)Every thread creating a new team can't take more threads than MAX_THREADS
2)Once a for loop is over in one subtree the others still working for loops in bigger subtrees take over the now idle threads to finish their job faster
That way I hope there is never more threads than necessary but they are all working all the time as long as there are more unfinished tasks in all for loops combined than created threads.
From the docs it looks like parallel for uses only threads already created in parallel region. Is it possible to make it work as described or do I need to change the implementation to list the tasks form various branches first and then run parallel for loop over that list?
Just for the record, I'll write an answer to this question based on High Performance Mark's comment (a comment on which I agree, too). The usage of OpenMP tasks here will add flexibility to the parallelism even if the tree is unbalanced, support recursivity and spawn enough work for all the threads (despite you should explore this using tools such as Vampir, Paraver and/or HPCToolkit).
The resulting code could look like
if (node->depth <= 3) {
#pragma omp parallel shared (global_sum)
{
for (int i = 0; i < node->children_no; i++) {
int local_sum;
#pragma omp single
#pragma omp task
{
local_sum = sum_numbers(node->children[i])
#pragma omp critical
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}

Sum of an array element using OpenMP

I am trying to parallelize the following nested "for loops" (in C) using OpenMP.
for (dt = 0; dt <= maxdt; dt++) {
for (t0 = 0; t0 <= nframes-dt; t0++) {
for (i=0; i<natoms; i++) {
VAC[dt] = VAC[dt] + dot_product(vect[t0][i],vect[t0+dt][i]) ;
}
}
}
Basically this calculates an auto-correlation function of a time dependent vector (vect). I need the VAC array as the final output using OpenMP.
I have tried using the reduction sum approach of OpenMP to perform this, by adding the following line above the innermost loop (for (i=0; i<natoms; i++)).
#pragma omp parallel for default(shared) private(i,axis) schedule(guided) reduction(+: VAC[dt])
But this does not work, since reduction sum does not work for arrays. What would be the best and most efficient way to parallelize such codes? Thanks.

Trouble with nested loops and openmp

I am having trouble applying openmp to a nested loop like this:
#pragma omp parallel shared(S2,nthreads,chunk) private(a,b,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("\nNumber of threads = %d\n", nthreads);
}
#pragma omp for schedule(dynamic,chunk)
for(a=0;a<NREC;a++){
for(b=0;b<NLIG;b++){
S2=S2+cos(1+sin(atan(sin(sqrt(a*2+b*5)+cos(a)+sqrt(b)))));
}
} // end for a
} /* end of parallel section */
When I compare the serial with the openmp version, the last one gives weird results. Even when I remove #pragma omp for, the results from openmp are not correct, do you know why or can point to a good tutorial explicit about double loops and openmp?
This is a classic example of a race condition. Each of your openmp threads is accessing and updating a shared value at the same time, and there's no guaantee that some of the updates won't get lost (at best) or the resulting answer won't be gibberish (at worst).
The thing with race conditions is that they depend sensitively on the timing; in a smaller case (eg, with smaller NREC and NLIG) you might sometimes miss this, but in a larger case, it'll eventually always come up.
The reason you get wrong answers without the #pragma omp for is that as soon as you enter the parallel region, all of your openmp threads start; and unless you use something like an omp for (a so-called worksharing construct) to split up the work, each thread will do everything in the parallel section - so all the threads will be doing the same entire sum, all updating S2 simultatneously.
You have to be careful with OpenMP threads updating shared variables. OpenMP has atomic operations to allow you to safely modify a shared variable. An example follows (unfortunately, your example is so sensitive to summation order it's hard to see what's going on, so I've changed your sum somewhat:). In the mysumallatomic, each thread updates S2 as before, but this time it's done safely:
#include <omp.h>
#include <math.h>
#include <stdio.h>
double mysumorig() {
double S2 = 0;
int a, b;
for(a=0;a<128;a++){
for(b=0;b<128;b++){
S2=S2+a*b;
}
}
return S2;
}
double mysumallatomic() {
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
double mysumonceatomic() {
double S2 = 0.;
#pragma omp parallel shared(S2)
{
double mysum = 0.;
#pragma omp for
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
mysum += (double)a*b;
}
}
#pragma omp atomic
S2 += mysum;
}
return S2;
}
int main() {
printf("(Serial) S2 = %f\n", mysumorig());
printf("(All Atomic) S2 = %f\n", mysumallatomic());
printf("(Atomic Once) S2 = %f\n", mysumonceatomic());
return 0;
}
However, that atomic operation really hurts parallel performance (after all, the whole point is to prevent parallel operation around the variable S2!) so a better approach is to do the summations and only do the atomic operation after both summations rather than doing it 128*128 times; that's the mysumonceatomic() routine, which only incurs the synchronization overhead once per thread rather than 16k times per thread.
But this is such a common operation that there's no need to implment it yourself. One can use an OpenMP built-in functionality for reduction operations (a reduction is an operation like calculating a sum of a list, finding the min or max of a list, etc, which can be done one element at a time only by looking at the result so far and the next element) as suggested by #ejd. OpenMP will work and is faster (it's optimized implementation is much faster than what you can do on your own with other OpenMP operations).
As you can see, either approach works:
$ ./foo
(Serial) S2 = 66064384.000000
(All Atomic) S2 = 66064384.000000
(Atomic Once) S2 = 66064384.00000
The problem isn't with double loops but with variable S2. Try putting a reduction clause on your for directive:
#pragma omp for schedule(dynamic,chunk) reduction(+:S2)

Resources