Is it possible to evaluate OpenMP overhead cost? - c

I am trying to optimize a code about image processing with OpenMP. I am still learning it and I came to the classic parallel for loop which is slower than my single threaded implementation.
The for loop I tried to parallelize has only 50 iterations and I saw it oculd explain why the OpenMP use in this case could be useless due to overhead operations.
But is it possible to evaluate these costs ? What are the cost differences between shared / private / firstprivate (...) clauses ?
This is the for loop I want to parallelize:
#pragma omp parallel for \
shared(img_in, img_height, img_width, w, update_gauss, MoGInitparameters, nb_motion_bloc) \
firstprivate(img_fg, gaussStruct) \
private(ContrastHisto, j, x, y) \
schedule(dynamic, 1) \
num_threads(max_threads)
for(i = 0; i<h; i++){ // h=50
for(j = 0; j<w; j++){
ComputeContrastHistogram_generic1D_rect(img_in, H_DESC_STEP*i, W_DESC_STEP*j, ContrastHisto, H_DESC_SIZE, W_DESC_SIZE, 2, img_height, img_width);
x = i;
y = w - 1 - j;
img_fg->data[y][x] = 1-MatchMoG_GaussianInt(ContrastHisto, gaussStruct->gauss[i*w+j], update_gauss, MoGInitparameters);;
#pragma omp critical
{
nb_motion_bloc += img_fg->data[y][x];
nb_motion_bloc += img_fg->data[y][x];
}
}
}
Maybe I am doing some mistakes but if that's the case please tell me why !!

Several points that don't address your specific questions.
Try to avoid #pragma omp critical. In this case, you can remove it altogether by adding a reduction(+:nb_motion_bloc ) clause to the omp parallel for line and using nb_motion_bloc += 2*img_fg->data[y][x];.
Depending on how much work each iteration has to do, short loops incur (much) more overhead than they're worth.
Now to the questions. If you don't change any of variables, don't bother classifying them as shared/private/firstprivate. If they are supposed to be used by each thread and discarded, you can use a construct like
#pragma omp parallel
{
int x, y;
#pragma omp for
for(i = 0; i<h; i++)
{
...
}
}
If the workload is balanced, then consider using schedule(static).
As to the differences between shared/private/firstprivate see this and this questions. From wikipedia:
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
default: allows the programmer to state that the default data scoping within a parallel region will be either shared, or none for C/C++, or shared, firstprivate, private, or none for Fortran. The none option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.
firstprivate: like private except initialized to original value.
lastprivate: like private except original value is updated after construct.

Related

OpenMP recursive tasks with shared data results in a huge slowdown

I am trying to implement a n-queens solver with OpenMP tasks. However, the game board is set in main function and I am giving it to a function.
So far, I have is:
bool solve_NQueens(int board[N][N], int col)
{
if (col == N)
{
// #pragma omp critical
// print_solution(board);
#pragma omp critical
SOLUTION_EXISTS = true;
return true;
}
for (int i = 0; i < N; i++)
{
if (can_be_placed(board, i, col))
{
int new_board[N][N];
board[i][col] = 1;
copy(board, new_board);
#pragma omp task firstprivate(col)
solve_NQueens(new_board, col + 1);
board[i][col] = 0;
}
}
return SOLUTION_EXISTS;
}
The initial call to this function in the main is:
#pragma omp parallel if(omp_get_num_threads() > 1)
{
#pragma omp single
{
#pragma omp taskgroup
{
solve_NQueens(board, 0);
}
}
}
Parameters:
// these are global
#define N 14
bool SOLUTION_EXISTS = false;
// these are in main
int board[N][N];
memset(board, 0, sizeof(board));
Compiler:
gcc
Number of threads: 4
I used taskgroup to wait all tasks before getting the result and I had to copy the game board for each task (which is a hard job when N is set to 14 since there are 356k solutions).
I tried to make board firstprivate or private, use taskwait inside and outside of the loop, use taskgroup inside the for loop and so on. I need some advice to optimize this logic.
Note: putting a taskgroup in the for loop under the if clause also helps, but this is much slower than expected.
First of all, there is a huge issue in your code: solve_NQueens can submit the tasks recursively and return before all the tasks are actually completed. You need to put a synchronization before the return so the value of SOLUTION_EXISTS will be valid (using either a #pragma omp taskwait or a #pragma omp taskgroup).
In terms of performance, there is multiple issues.
The main problem is that to many tasks are created: you create a task in each recursive call. While creating few tasks bring the needed parallelism, creating to much of them also introduces a significant overhead. This overhead can be much higher than the execution of the tail calls. A cut-off strategy to can be implemented to reduce the overhead: the general idea is to create tasks only for the first recursive calls. In your case, you can do it with a clause if(col < 3) at the end of the #pragma omp task. Please note that 3 is an arbitrary value, you may need to tune this threshold.
Moreover, board is copied (twice) during the task creation (since it is a static array and default variables required by an OpenMP task are implicitly copied). Your additional copy is not needed and the line board[i][col] = 0; is useless *if the code is compiled with the OpenMP support (otherwise pragma are ignored and this is not true*). However, the additional overhead introduced should not be critical if you fix the problem described above.

Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
for(i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
}
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
for(i=0;i<nx;i++){
next[i] = (char*)(next+nx) + i*ny;
}
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(i=1;i<nx-1;i++){
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}
Is there any problematic mistake in my way? How can I improve this?
In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
}
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}

Specify which positions in an array a thread access

I'm trying to create a program that creates an array and, with OpenMP, assigns values to each position in that array. That would be trivial, except that I want to specify which positions an array is responsible for.
For example, if I have an array of length 80 and 8 threads, I want to make sure that thread 0 only writes to positions 0-9, thread 1 to 10-19 and so on.
I'm very new to OpenMP, so I tried the following:
#include <omp.h>
#include <stdio.h>
#define N 80
int main (int argc, char *argv[])
{
int nthreads = 8, tid, i, base, a[N];
#pragma omp parallel
{
tid = omp_get_thread_num();
base = ((float)tid/(float)nthreads) * N;
for (i = 0; i < N/nthreads; i++) {
a[base + i] = 0;
printf("%d %d\n", tid, base+i);
}
}
return 0;
}
This program, however, doesn't access all positions, as I expected. The output is different every time I run it, and it might be for example:
4 40
5 51
5 52
5 53
5 54
5 55
5 56
5 57
5 58
5 59
5 50
4 40
6 60
6 60
3 30
0 0
1 10
I think I'm missing a directive, but I don't know which one it is.
The way to ensure that things work the way you want is to have a loop of just 8 iterations as the outer (parallel) loop, and have each thread execute an inner loop which accesses just the right elements:
#pragma omp parallel for private(j)
for(i = 0; i < 8; i++) {
for(j = 0; j < 10; j++) {
a[10*i+j] = 0;
printf("thread %d updated element %d\n", omp_get_thread_num(), 8*i+j);
}
}
I was unable to test this right now but I'm 90% sure this does exactly what you want (and you have "complete control" over how things work when you do it like this). However it may not be the most efficient thing to do. For one thing - when you just want to set a bunch of elements to zero, you want to use a built in function like memset, not a loop...
You're missing a fair bit. The directive
#pragma omp parallel
only tells the run time that the following block of code is to be executed in parallel, essentially by all threads. But it doesn't specify that the work is to be shared out across threads, just that all threads are to execute the block. To share the work your code will need another directive, something like this
#pragma omp parallel
{
#pragma omp for
...
It's the for directive which distributes the work across threads.
However, you are making a mistake in the design of your program which is even more serious than your unfamiliarity with the syntax of OpenMP. Manual decomposition of work across threads, as you propose, is just what OpenMP is designed to help programmers avoid. By trying to do the decomposition yourself you are programming against the grain of OpenMP and run two risks:
Of getting things wrong; in particular of getting wrong matters that the compiler and run-time will get right with no effort or thought on your part.
Of carefully crafting a parallel program which runs more slowly than its serial equivalent.
If you want some control over the allocation of work to threads investigate the schedule clause. I suggest that you start your parallel region something like this (note that I am fusing the two directives into one statement):
#pragma omp parallel for default(none) shared(a,base,N)
{
for (i = 0; i < N; i++) {
a[base + i] = 0;
}
Note also that I have specified the accessibility of variables. This is a good practice especially when learning OpenMP. The compiler will make i private automatically.
As I have written it the run-time will divide the iterations over i into chunks, one for each thread. The first thread will get i = 0..N/num_threads, the second i = (N/num_threads)+1..2N/num_threads and so on.
Later you can add a schedule clause explicitly to the directive. What I have written above is equivalent to
#pragma omp parallel for default(none) shared(a,N) schedule(static)
but you can also experiment with
#pragma omp parallel for default(none) shared(a,N) schedule(dynamic,chunk_size)
and a number of other options which are well documented in the usual places.
#pragma omp parallel is not enough for the for loop to be parallelized.
Ummm... I noticed that you actually try to distribute work by hand. The reason it does not work is most probably becasue of racing conditions on computing the parameters for the for loop.
If I recall properly any variables declared outside of the parallel region are shared among threads. So ALL threads write to i, tid and base at once. You could make it work with appropriate private/shared clauses.
However, a better ways is to let OpenMP distribute the work.
This is sufficient:
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
#pramga omp for
for (i = 0; i < N; i++) {
a[i] = 0;
printf("%d %d\n", tid, i);
}
}
Note that private(tid) it makes a local copy of tid for each thread, so they do not overwrite each other on the omp_get_thread_num(). Also it is possible to declare shared(a) because we want each thread to work on the same copy of table. This is implicit now. I believe iterators should be declared private, but I think pragma takes care of it, not 100% how it is this specific case, when its declared outside the parallel region. But I'm sure you can actually set it to shared by hand and mess it up.
EDIT: I noticed original underlying problem so I took out irrelevant parts.

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

OpenMP - Why is firstprivate causing an error?

Why am I getting this error, and what should I do?
error: firstprivate variable 'j' is private in outer context
void foo() {
int i;
int j = 10;
#pragma omp for firstprivate(j)
for (i = 0; i < 10; i++)
printf("%d\n", j);
}
It works if you use the pragma
#pragma omp parallel for firstprivate(j)
Note that omp for and omp parallel for aren't the same thing: the latter is shorthand for an omp for inside an omp parallel.
I deleted my first answer because I missed something and it was incorrect. The error is correct because of a restriction in the OpenMP V3.0 spec (and previous versions), section 2.9.3.4 firstprivate clause, Restrictions bullet 2:
• A list item that is private within a parallel region must not appear in a
firstprivate clause on a worksharing construct if any of the worksharing
regions arising from the worksharing construct ever bind to any of the parallel
regions arising from the parallel construct.
The problem is that it doesn't know which private value to use among the threads that are to execute the worksharing region. If it is a new parallel region, then each thread will create a new region and the firstprivate is copied from the private copy of the thread creating the region.

Resources