Can the producer/consumer ( bounded-buffer ) be sped up through openMP? - c

I'm implementing the producer/consumer problem for homework, and I have to compare the sequential algorithm with the parallel one, and my parallel one seems to only be able to run either at the same speed or slower than the sequential one. I've come to the conclusion that using a queue is a limiting factor and it won't speed up my algorithm.
Is this the case or am I just coding it wrong?
int main() {
long sum = 0;
unsigned long serial = ::GetTickCount();
for(int i = 0; i < test; i++){
enqueue(rand()%54354);
sum+= dequeue();
}
printf("%d \n",sum);
serial = (::GetTickCount() - serial);
printf("Serial Program took: %f seconds\n", serial * .001);
sum = 0;
unsigned long omp = ::GetTickCount();
#pragma omp parallel for num_threads(128) default(shared)
for(int i = 0; i < test; i++){
enqueue(rand()%54354);
sum+= dequeue();
}
#pragma omp barrier //joins all threads
omp = (::GetTickCount() - omp);
printf("%d \n",sum);
printf("OpenMP Program took: %f seconds\n", omp * .001);
getchar();
}

Problem #1:
You have rand() inside the parallel region.
rand() is not thread-safe. It uses global/static variables. So calling it concurrently from multiple threads will lead to unexpected (possibly undefined) behavior.
That aside, the data-races resulting from concurrent calls to rand() will lead to a lot of cache coherency stalls. This is likely the source of the slowdown.
Problem #2:
Is enqueue() and dequeue() thread-safe?
If it isn't, then you need to fix that first. If it is, how are you synchronizing it?
If it's just a critical region that allows only one thread at a time to access the queue, then that kind of defeats the whole purpose of parallelism.
Problem #3:
This line modifies the sum variable in each iteration:
sum += dequeue();
Note that all the threads will be doing this concurrently. So you need to declare sum as a reduction variable.

Related

Array operations in a loop parallelization with openMP

I am trying to parallelize for loops which are based on array operations. However, I cannot get expected speedup. I guess the way of parallelization is wrong in my implementation.
Here is one example:
curr = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
next = (char**)malloc(sizeof(char*)*nx + sizeof(char)*nx*ny);
int i;
#pragma omp parallel for shared(nx,ny) firstprivate(curr) schedule(static)
for(i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
}
#pragma omp parallel for shared(nx,ny) firstprivate(next) schedule(static)
for(i=0;i<nx;i++){
next[i] = (char*)(next+nx) + i*ny;
}
And here is another:
int i,j, sum = 0, probability = 0.2;
#pragma omp parallel for collapse(2) firstprivate(curr) schedule(static)
for(i=1;i<nx-1;i++){
for(j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}
Is there any problematic mistake in my way? How can I improve this?
In the first example, the work done by each thread is very little and the overhead from the OpenMP runtime is negating and speedup from the parallel execution. You may try combining both parallel regions together to reduce the overhead, but it won't help much:
#pragma omp parallel for schedule(static)
for(int i=0;i<nx;i++){
curr[i] = (char*)(curr+nx) + i*ny;
next[i] = (char*)(next+nx) + i*ny;
}
In the second case, the bottleneck is the call to drand48(), buried somewhere in the call to real_rand(), and the summation. drand48 uses a global state that is shared between all threads. In single-threaded applications, the state is usually kept in the L1 data cache and there drand48 is really fast. In your case, when one thread updates the state, this change propagates to the other cores and invalidates their caches. Consequently, when the other threads call drand48, the state has to be fetched again from the memory (or shared L3 cache). This introduces huge delays and makes dran48 much slower than when used in a single-threaded program. The same applies to the summation in sum, which also computes the wrong value due to data races.
The solution to the first problem is to have separate PRNG per thread, e.g., use erand48() and pass a thread-local value for xsubi. You have to also seed each PRNG with a different value to avoid correlated pseudorandom streams. The solution of the data race is to use OpenMP reductions:
int sum = 0;
double probability = 0.2;
#pragma omp parallel for collapse(2) reduction(+:sum) schedule(static)
for(int i=1;i<nx-1;i++){
for(int j=1;j<ny-1;j++) {
curr[i][j] = (real_rand() < probability);
sum += curr[i][j];
}
}

OpenMP - Overhead when Spawning and Terminating Threads in for-loop

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.
I have a for-loop which must be ran serially which calls the new_value() function:
for(int i = 0; i < MAX_VAL; i++)
new_value();
This function opens a parallel region on each call:
void new_value()
{
#pragma omp parallel default(shared)
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
}
Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?
There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:
This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.
This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.
The answer to your second question is actually correct.
#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
new_value();
void new_value()
{
int thread_rank = omp_get_thread_num();
#pragma omp for schedule(static)
for(int i = 0; i < N; i++)
arr[i] = update(thread_rank);
}
Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).
So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

Reductions in parallel in logarithmic time

Given n partial sums it's possible to sum all the partial sums in log2 parallel steps. For example assume there are eight threads with eight partial sums: s0, s1, s2, s3, s4, s5, s6, s7. This could be reduced in log2(8) = 3 sequential steps like this;
thread0 thread1 thread2 thread4
s0 += s1 s2 += s3 s4 += s5 s6 +=s7
s0 += s2 s4 += s6
s0 += s4
I would like to do this with OpenMP but I don't want to use OpenMP's reduction clause. I have come up with a solution but I think a better solution can be found maybe using OpenMP's task clause.
This is more general than scalar addition. Let me choose a more useful case: an array reduction (see here, here, and here for more about array reductions).
Let's say I want to do an array reduction on an array a. Here is some code which fills private arrays in parallel for each thread.
int bins = 20;
int a[bins];
int **at; // array of pointers to arrays
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
#pragma omp single
at = (int**)malloc(sizeof *at * omp_get_num_threads());
at[omp_get_thread_num()] = (int*)malloc(sizeof **at * bins);
int a_private[bins];
//arbitrary function to fill the arrays for each thread
for(int i = 0; i<bins; i++) at[omp_get_thread_num()][i] = i + omp_get_thread_num();
}
At this point I have have an array of pointers to arrays for each thread. Now I want to add all these arrays together and write the final sum to a. Here is the solution I came up with.
#pragma omp parallel
{
int n = omp_get_num_threads();
for(int m=1; n>1; m*=2) {
int c = n%2;
n/=2;
#pragma omp for
for(int i = 0; i<n; i++) {
int *p1 = at[2*i*m], *p2 = at[2*i*m+m];
for(int j = 0; j<bins; j++) p1[j] += p2[j];
}
n+=c;
}
#pragma omp single
memcpy(a, at[0], sizeof *a*bins);
free(at[omp_get_thread_num()]);
#pragma omp single
free(at);
}
Let me try and explain what this code does. Let's assume there are eight threads. Let's define the += operator to mean to sum over the array. e.g. s0 += s1 is
for(int i=0; i<bins; i++) s0[i] += s1[i]
then this code would do
n thread0 thread1 thread2 thread4
4 s0 += s1 s2 += s3 s4 += s5 s6 +=s7
2 s0 += s2 s4 += s6
1 s0 += s4
But this code is not ideal as I would like it.
One problem is that there are a few implicit barriers which require all the threads to sync. These barriers should not be necessary. The first barrier is between filling the arrays and doing the reduction. The second barrier is in the #pragma omp for declaration in the reduction. But I can't use the nowait clause with this method to remove the barrier.
Another problem is that there are several threads that don't need to be used. For example with eight threads. The first step in the reduction only needs four threads, the second step two threads, and the last step only one thread. However, this method would involve all eight threads in the reduction. Although, the other threads don't do much anyway and should go right to the barrier and wait so it's probably not much of an issue.
My instinct is that a better method can be found using the omp task clause. Unfortunately I have little experience with the task clause and all my efforts so far with it do a reduction better than what I have now have failed.
Can someone suggest a better solution to do the reduction in logarithmic time using e.g. OpenMP's task clause?
I found a method which solves the barrier problem. This reduces asynchronously. The only remaining problem is that it still puts threads which don't participate in the reduction into a busy loop. This method uses something like a stack to push pointers to the stack (but never pops them) in critical sections (this was one of the keys as critical sections don't have implicit barriers. The stack is operated on serially but the reduction in parallel.
Here is a working example.
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <string.h>
void foo6() {
int nthreads = 13;
omp_set_num_threads(nthreads);
int bins= 21;
int a[bins];
int **at;
int m = 0;
int nsums = 0;
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
int n = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
at = (int**)malloc(sizeof *at * n * 2);
int* a_private = (int*)malloc(sizeof *a_private * bins);
//arbitrary fill function
for(int i = 0; i<bins; i++) a_private[i] = i + omp_get_thread_num();
#pragma omp critical (stack_section)
at[nsums++] = a_private;
while(nsums<2*n-2) {
int *p1, *p2;
char pop = 0;
#pragma omp critical (stack_section)
if((nsums-m)>1) p1 = at[m], p2 = at[m+1], m +=2, pop = 1;
if(pop) {
for(int i = 0; i<bins; i++) p1[i] += p2[i];
#pragma omp critical (stack_section)
at[nsums++] = p1;
}
}
#pragma omp barrier
#pragma omp single
memcpy(a, at[2*n-2], sizeof **at *bins);
free(a_private);
#pragma omp single
free(at);
}
for(int i = 0; i<bins; i++) printf("%d ", a[i]); puts("");
for(int i = 0; i<bins; i++) printf("%d ", (nthreads-1)*nthreads/2 +nthreads*i); puts("");
}
int main(void) {
foo6();
}
I sill feel a better method may be found using tasks which does not put the threads not being used in a busy loop.
Actually, it is quite simple to implement that cleanly with tasks using a recursive divide-and-conquer approach. This is almost textbook code.
void operation(int* p1, int* p2, size_t bins)
{
for (int i = 0; i < bins; i++)
p1[i] += p2[i];
}
void reduce(int** arrs, size_t bins, int begin, int end)
{
assert(begin < end);
if (end - begin == 1) {
return;
}
int pivot = (begin + end) / 2;
/* Moving the termination condition here will avoid very short tasks,
* but make the code less nice. */
#pragma omp task
reduce(arrs, bins, begin, pivot);
#pragma omp task
reduce(arrs, bins, pivot, end);
#pragma omp taskwait
/* now begin and pivot contain the partial sums. */
operation(arrs[begin], arrs[pivot], bins);
}
/* call this within a parallel region */
#pragma omp single
reduce(at, bins, 0, n);
As far as i can tell, there are no unnecessary synchronizations and there is no weird polling on critical sections. It also works naturally with a data size different than your number of ranks. I find it very clean and easy to understand. So I do indeed think this is better than both of your solutions.
But let's look at how it performs in practice*. For that we can use Score-p and Vampir:
*bins=10000 so the reduction actually takes a little bit of time. Executed on a 24-core Haswell system w/o turbo. gcc 4.8.4, -O3. I added some buffer around the actual execution to hide initialization/post-processing
The picture reveals what is happening at any thread within the application on a horizontal time-axis. The tree implementations from top to bottom:
omp for loop
omp critical kind of tasking.
omp task
This shows nicely how the specific implementations actually execute. Now it seems that the for loop is actually the fastest, despite the unnecessary synchronizations. But there are still a number of flaws in this performance analysis. For example, I didn't pin the threads. In practice NUMA (non-uniform memory access) matters a lot: Does the core does have this data in it's own cache / memory of it's own socket? This is where the task solution becomes non-deterministic. The very significant variance among repetitions is not considered in the simple comparison.
If the reduction operation becomes variable in runtime, then the task solution will become better than thy synchronized for loop.
The critical solution has some interesting aspect, the passive threads are not continuously waiting, so they will more likely consume CPU resources. This can be bad for performance e.g. in case of turbo mode.
Remember that the task solution has more optimization potential by avoiding spawning tasks that immediately return. How these solutions perform also highly depends on the specific OpenMP runtime. Intel's runtime seems to do much worse for tasks.
My recommendation is:
Implement the most maintainable solution with optimal algorithmic
complexity
Measure which parts of the code actually matter for run-time
Analyze based on actual measurements what is the bottleneck. In my experience it is more about NUMA and scheduling rather than some unnecessary barrier.
Perform the micro-optimization based on your actual measurements
Linear solution
Here is the timeline for the linear proccess_data_v1 from this question.
OpenMP 4 Reduction
So I thought about OpenMP reduction. The tricky part seems to be getting the data from the at array inside the loop without a copy. I do initialize the worker array with NULL and simply move the pointer the first time:
void meta_op(int** pp1, int* p2, size_t bins)
{
if (*pp1 == NULL) {
*pp1 = p2;
return;
}
operation(*pp1, p2, bins);
}
// ...
// declare before parallel region as global
int* awork = NULL;
#pragma omp declare reduction(merge : int* : meta_op(&omp_out, omp_in, 100000)) initializer (omp_priv=NULL)
#pragma omp for reduction(merge : awork)
for (int t = 0; t < n; t++) {
meta_op(&awork, at[t], bins);
}
Surprisingly, this doesn't look too good:
top is icc 16.0.2, bottom is gcc 5.3.0, both with -O3.
Both seem to implement the reduction serialized. I tried to look into gcc / libgomp, but it's not immediately apparent to me what is happening. From intermediate code / disassembly, they seem to be wrapping the final merge in a GOMP_atomic_start/end - and that seems to be a global mutex. Similarly icc wraps the call to the operation in a kmpc_critical. I suppose there wasn't much optimization going into costly custom reduction operations. A traditional reduction can be done with a hardware-supported atomic operation.
Notice how each operation is faster because the input is cached locally, but due to the serialization it is overall slower. Again this is not a perfect comparison due to high variances, and earlier screenshots were with different gcc version. But the trend is clear, and I also have data on the cache effects.

OpenMP average of an array

I'm trying to learn OpenMP for a program I'm writing. For part of it I'm trying to implement a function to find the average of a large array. Here is my code:
double mean(double* mean_array){
double mean = 0;
omp_set_num_threads( 4 );
#pragma omp parallel for reduction(+:mean)
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
}
printf("hello %d\n", omp_get_thread_num());
mean = mean/aSize;
return mean;
}
However if I run the code it runs slower than the sequential version. Also for the print statement I get:
hello 0
hello 0
Which doesn't make much sense to me, shouldn't there be 4 hellos?
Any help would be appreciated.
First, the reason why you are not seeing 4 "hello"s, is because the only part of the program which is executed in parallel is the so called parallel region enclosed within an #pragma omp parallel. In your code that is the loop body (since the omp parallel directive is attached to the for statement), the printf is in the sequential part of the program.
rewriting the code as follows would do the trick:
double mean = 0;
#pragma omp parallel num_threads(4)
{
#pragma omp for reduction(+:mean)
for (int i=0; i<aSize; i++) {
mean += mean_array[i];
}
mean /= aSize;
printf("hello %d\n", omp_get_thread_num());
}
Second, the fact your program runs slower than the sequential version, it can depend on multiple factors. First of all, you need to make sure the array is large enough so that the overhead of creating those threads (which usually happens when the parallel region is created) is negligible. Also, for small arrays you may be running into "cache false sharing" issues in which threads are competing for the same cache line causing performance degradation.

Trouble with nested loops and openmp

I am having trouble applying openmp to a nested loop like this:
#pragma omp parallel shared(S2,nthreads,chunk) private(a,b,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("\nNumber of threads = %d\n", nthreads);
}
#pragma omp for schedule(dynamic,chunk)
for(a=0;a<NREC;a++){
for(b=0;b<NLIG;b++){
S2=S2+cos(1+sin(atan(sin(sqrt(a*2+b*5)+cos(a)+sqrt(b)))));
}
} // end for a
} /* end of parallel section */
When I compare the serial with the openmp version, the last one gives weird results. Even when I remove #pragma omp for, the results from openmp are not correct, do you know why or can point to a good tutorial explicit about double loops and openmp?
This is a classic example of a race condition. Each of your openmp threads is accessing and updating a shared value at the same time, and there's no guaantee that some of the updates won't get lost (at best) or the resulting answer won't be gibberish (at worst).
The thing with race conditions is that they depend sensitively on the timing; in a smaller case (eg, with smaller NREC and NLIG) you might sometimes miss this, but in a larger case, it'll eventually always come up.
The reason you get wrong answers without the #pragma omp for is that as soon as you enter the parallel region, all of your openmp threads start; and unless you use something like an omp for (a so-called worksharing construct) to split up the work, each thread will do everything in the parallel section - so all the threads will be doing the same entire sum, all updating S2 simultatneously.
You have to be careful with OpenMP threads updating shared variables. OpenMP has atomic operations to allow you to safely modify a shared variable. An example follows (unfortunately, your example is so sensitive to summation order it's hard to see what's going on, so I've changed your sum somewhat:). In the mysumallatomic, each thread updates S2 as before, but this time it's done safely:
#include <omp.h>
#include <math.h>
#include <stdio.h>
double mysumorig() {
double S2 = 0;
int a, b;
for(a=0;a<128;a++){
for(b=0;b<128;b++){
S2=S2+a*b;
}
}
return S2;
}
double mysumallatomic() {
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
double myterm = (double)a*b;
#pragma omp atomic
S2 += myterm;
}
}
return S2;
}
double mysumonceatomic() {
double S2 = 0.;
#pragma omp parallel shared(S2)
{
double mysum = 0.;
#pragma omp for
for(int a=0; a<128; a++){
for(int b=0; b<128;b++){
mysum += (double)a*b;
}
}
#pragma omp atomic
S2 += mysum;
}
return S2;
}
int main() {
printf("(Serial) S2 = %f\n", mysumorig());
printf("(All Atomic) S2 = %f\n", mysumallatomic());
printf("(Atomic Once) S2 = %f\n", mysumonceatomic());
return 0;
}
However, that atomic operation really hurts parallel performance (after all, the whole point is to prevent parallel operation around the variable S2!) so a better approach is to do the summations and only do the atomic operation after both summations rather than doing it 128*128 times; that's the mysumonceatomic() routine, which only incurs the synchronization overhead once per thread rather than 16k times per thread.
But this is such a common operation that there's no need to implment it yourself. One can use an OpenMP built-in functionality for reduction operations (a reduction is an operation like calculating a sum of a list, finding the min or max of a list, etc, which can be done one element at a time only by looking at the result so far and the next element) as suggested by #ejd. OpenMP will work and is faster (it's optimized implementation is much faster than what you can do on your own with other OpenMP operations).
As you can see, either approach works:
$ ./foo
(Serial) S2 = 66064384.000000
(All Atomic) S2 = 66064384.000000
(Atomic Once) S2 = 66064384.00000
The problem isn't with double loops but with variable S2. Try putting a reduction clause on your for directive:
#pragma omp for schedule(dynamic,chunk) reduction(+:S2)

Resources