OpenMP average of an array

OpenMP average of an array - c

I'm trying to learn OpenMP for a program I'm writing. For part of it I'm trying to implement a function to find the average of a large array. Here is my code:
double mean(double* mean_array){
double mean = 0;
omp_set_num_threads( 4 );
#pragma omp parallel for reduction(+:mean)
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
}
printf("hello %d\n", omp_get_thread_num());
mean = mean/aSize;
return mean;
}
However if I run the code it runs slower than the sequential version. Also for the print statement I get:
hello 0
hello 0
Which doesn't make much sense to me, shouldn't there be 4 hellos?
Any help would be appreciated.

First, the reason why you are not seeing 4 "hello"s, is because the only part of the program which is executed in parallel is the so called parallel region enclosed within an #pragma omp parallel. In your code that is the loop body (since the omp parallel directive is attached to the for statement), the printf is in the sequential part of the program.
rewriting the code as follows would do the trick:
double mean = 0;
#pragma omp parallel num_threads(4)
{
#pragma omp for reduction(+:mean)
for (int i=0; i<aSize; i++) {
mean += mean_array[i];
}
mean /= aSize;
printf("hello %d\n", omp_get_thread_num());
}
Second, the fact your program runs slower than the sequential version, it can depend on multiple factors. First of all, you need to make sure the array is large enough so that the overhead of creating those threads (which usually happens when the parallel region is created) is negligible. Also, for small arrays you may be running into "cache false sharing" issues in which threads are competing for the same cache line causing performance degradation.

Related

Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
for(i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
}
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
for(i=0;i<nx;i++){
next[i] = (char*)(next+nx) + i*ny;
}
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(i=1;i<nx-1;i++){
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}
Is there any problematic mistake in my way? How can I improve this?

In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
}
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}

OpenMP - Overhead when Spawning and Terminating Threads in for-loop

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.
I have a for-loop which must be ran serially which calls the new_value() function:
for(int i = 0; i < MAX_VAL; i++)
new_value();
This function opens a parallel region on each call:
void new_value()
{
#pragma omp parallel default(shared)
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
}
Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?
There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:
This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.
This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.

The answer to your second question is actually correct.
#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
new_value();
void new_value()
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).
So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

how to avoid overhead of openMP in nested loops

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?

An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.

You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

Specify which positions in an array a thread access

I'm trying to create a program that creates an array and, with OpenMP, assigns values to each position in that array. That would be trivial, except that I want to specify which positions an array is responsible for.
For example, if I have an array of length 80 and 8 threads, I want to make sure that thread 0 only writes to positions 0-9, thread 1 to 10-19 and so on.
I'm very new to OpenMP, so I tried the following:
#include <omp.h>
#include <stdio.h>
#define N 80
int main (int argc, char *argv[])
{
int nthreads = 8, tid, i, base, a[N];
#pragma omp parallel
{
tid = omp_get_thread_num();
base = ((float)tid/(float)nthreads) * N;
for (i = 0; i < N/nthreads; i++) {
a[base + i] = 0;
printf("%d %d\n", tid, base+i);
}
}
return 0;
}
This program, however, doesn't access all positions, as I expected. The output is different every time I run it, and it might be for example:
4 40
5 51
5 52
5 53
5 54
5 55
5 56
5 57
5 58
5 59
5 50
4 40
6 60
6 60
3 30
0 0
1 10
I think I'm missing a directive, but I don't know which one it is.

The way to ensure that things work the way you want is to have a loop of just 8 iterations as the outer (parallel) loop, and have each thread execute an inner loop which accesses just the right elements:
#pragma omp parallel for private(j)
for(i = 0; i < 8; i++) {
for(j = 0; j < 10; j++) {
a[10*i+j] = 0;
printf("thread %d updated element %d\n", omp_get_thread_num(), 8*i+j);
}
}
I was unable to test this right now but I'm 90% sure this does exactly what you want (and you have "complete control" over how things work when you do it like this). However it may not be the most efficient thing to do. For one thing - when you just want to set a bunch of elements to zero, you want to use a built in function like memset, not a loop...

You're missing a fair bit. The directive
#pragma omp parallel
only tells the run time that the following block of code is to be executed in parallel, essentially by all threads. But it doesn't specify that the work is to be shared out across threads, just that all threads are to execute the block. To share the work your code will need another directive, something like this
#pragma omp parallel
{
#pragma omp for
...
It's the for directive which distributes the work across threads.
However, you are making a mistake in the design of your program which is even more serious than your unfamiliarity with the syntax of OpenMP. Manual decomposition of work across threads, as you propose, is just what OpenMP is designed to help programmers avoid. By trying to do the decomposition yourself you are programming against the grain of OpenMP and run two risks:
Of getting things wrong; in particular of getting wrong matters that the compiler and run-time will get right with no effort or thought on your part.
Of carefully crafting a parallel program which runs more slowly than its serial equivalent.
If you want some control over the allocation of work to threads investigate the schedule clause. I suggest that you start your parallel region something like this (note that I am fusing the two directives into one statement):
#pragma omp parallel for default(none) shared(a,base,N)
{
for (i = 0; i < N; i++) {
a[base + i] = 0;
}
Note also that I have specified the accessibility of variables. This is a good practice especially when learning OpenMP. The compiler will make i private automatically.
As I have written it the run-time will divide the iterations over i into chunks, one for each thread. The first thread will get i = 0..N/num_threads, the second i = (N/num_threads)+1..2N/num_threads and so on.
Later you can add a schedule clause explicitly to the directive. What I have written above is equivalent to
#pragma omp parallel for default(none) shared(a,N) schedule(static)
but you can also experiment with
#pragma omp parallel for default(none) shared(a,N) schedule(dynamic,chunk_size)
and a number of other options which are well documented in the usual places.

#pragma omp parallel is not enough for the for loop to be parallelized.
Ummm... I noticed that you actually try to distribute work by hand. The reason it does not work is most probably becasue of racing conditions on computing the parameters for the for loop.
If I recall properly any variables declared outside of the parallel region are shared among threads. So ALL threads write to i, tid and base at once. You could make it work with appropriate private/shared clauses.
However, a better ways is to let OpenMP distribute the work.
This is sufficient:
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
#pramga omp for
for (i = 0; i < N; i++) {
a[i] = 0;
printf("%d %d\n", tid, i);
}
}
Note that private(tid) it makes a local copy of tid for each thread, so they do not overwrite each other on the omp_get_thread_num(). Also it is possible to declare shared(a) because we want each thread to work on the same copy of table. This is implicit now. I believe iterators should be declared private, but I think pragma takes care of it, not 100% how it is this specific case, when its declared outside the parallel region. But I'm sure you can actually set it to shared by hand and mess it up.
EDIT: I noticed original underlying problem so I took out irrelevant parts.

Improving Performance in OpenMP program

Suppose an array arr of SIZE=128Mb with values from 0 to 128Mb-1. Now suppose the following code:
#pragma omp parallel num_threads(NUM_THREADS)
{
int me = omp_get_thread_num();
odds_local[me] = 0;
int count = 0;
#pragma omp for
for (int i = 0; i < SIZE; i++)
if (arr[i]%2 != 0)
count++;
odds_local[me] = count;
}
and finally a loop that iterates over the values of odds_local[me] to get the final result. For this, if I time it and report user time in Linux I get 0.97s for both 1 thread and 2 threads. That is to say, no speedup whatsoever.
Is there anything I should be improving in this program to better the speedup? Thanks.

I ran your exact code and with 1 thread I get 390ms, with 2 I get 190ms. Your problem is not in the code. It has to be something basic. These are the things I can think of:
not linking with OpenMP (with g++ filename -fopenmp);
running on a single core machine;
running on a dual core, with something else occupying the other core;
timing something more than this loop, which is dominating the calculation.