openmp reduction does not provide the same answer as the sequential methodd - c

I am trying to parallelize a vector dot product program using OpenMP. The following code shows what I did.
#define N 1000000
float dotProduct = 0;
float vector1Host[N], vector2Host[N]; //each element in the vectors are initialized to a value between 1 and 2
#pragma omp parallel for private(i) reduction(+:dotProduct)
for (i = 0; i < N; i++)
dotProduct += vector1Host[i] * vector2Host[i];
The answer I get here is slightly different than what I get when I do the multiplication sequentially. Further, when I remove the reduction(+:dotProduct) and calculate the multiplications of each item seperately and add them together later (sequentially) I get the same answer as the completely sequential method.
float productComponents[N];
#pragma omp parallel for private(i)
for (i = 0; i < N; i++)
productComponents[i] += vector1Host[i] * vector2Host[i];
for (i=0; i<N; i++)
dotProduct += productComponents[i];
The issue with this method is the performance. Please help me in finding the error in the first method, or an alternative method with good performance.
Update:
I added the output from a sample run.
N=1000000: Ans=2251335.750000: Time(ms)=2.59163 //sequential
N=1000000: Ans=2251356.750000: Time(ms)=0.65846 //openmp

Floating point operations are not commutative. Therefore it is possible that your code is giving differing and possibly unpredictable results based on the order in which the floats are added to the accumulating variable.
Openmp due to the nature of parallelising the code results in the additions being performed in an arbitrary order and thus causes slightly unpredictable value due to the above non commutative behaviour of floats.
Either you need to accept this unpredictability or serialise the additions.
The other option would be to use a fixed point library which was able to guarantee commutative addition, in which case the answer would be predictable regardless of the resulting order of the additions.

Related

OpenMP - Initializing 1D Array

I am trying to initialize a dynamic array using OpenMP in C but it seems to be slower than the serial method. The function I am using is
int* createArray(int size, int num) {
int i;
int* M = (int*)malloc(size*sizeof(int));
srand(time(NULL));
double start = omp_get_wtime();
#pragma omp parallel for num_threads(num)
for (i = 0; i < size; i++) {
M[i] = rand() % (MAX_NUMBER - MIN_NUMBER + 1) + MIN_NUMBER;
}
double end = omp_get_wtime();
printf("Create Array %f\n",end-start);
return M;
}
I get an array of the given size containing random numbers but the fewer threads I use the faster the function is. Am I doing something wrong?
In general, a parallel application running slower than the corresponding sequential implementation usually comes from either the overhead of starting the threads or the bottleneck of having threads that are not perfectly independent (e.g. through shared data).
Here, the former is true because you are calling rand(). This function uses somehow global variables which have to be shared between threads. A way to overcome this would be to use a private seed for each thread. Furthermore, did you notice that your array is not really random when you have multiple threads? You could make the seed provided to srand() a function of omp_get_thread_num() to solve this.
I'm quite sure your program is suffering a problem called "false sharing" cache.
The article below explains it quite well.
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
this often affects performance a lot.
you can quickly have a test. add below to your omp pragma
schedule(static, 16)
this should improve a lot. then you can dig further about false sharing.

nested loops with pragma omp parallel for, jumbling up

I am trying to parallelize a code to run some simulations on spiking neuron network. This involves one double loop, where I put a statement '#pragma omp parallel for' outside the main loop. Here's the code:
int main(void){
int i,j,count[200];
#pragma omp parallel for
for(i=0;i<200;i++){
count[i] = 0;
for (j=0;j<200;j++){
if (j!=i){
count[i]++;
printf("i: %d j: %d count[i]:%d, count[i]-j:%d\n",i,j,count[i], count[i]-j);
}
}
}
return 0;
}
Looking at the results, some of the values of count[i] exceed 200, even though the loop only goes from 1 to 200. count[i]-j can either be 0,1 or -1, but the values differ widely, even thought each thread would work on one value of i, and the count array depends only on the current value of i. How do I rewrite the code so that I can safely increment count?
You must declare j as private. You can do so explicitly via:
#pragma omp parallel for private(j)
i is implicitly private being the loop variable of the worksharing loop. count is implicitly shared because it it defined outside of the loop. Both of which are desirable here.
However, I strongly recommend to always declare variables al locally as possible, especially when using OpenMP. This way the implicit private/shared is almost always right, and it avoids lots of subtle undefined value reads. It is generally good practice
int count[200];
#pragma omp parallel for
for(int i=0;i<200;i++){
count[i] = 0;
for (int j=0;j<200;j++){
BTW: Your printout of count[i]-j can show completely arbitrary values. It accesses data that it potentially concurrently written by other threads.

Get Loop Index at Start of OpenMP Loop

I have a for loop, each iteration of which is almost completely independent of every other iteration. I did some initial experimenting with OpenMP to see if I could speed it up. Using one simple directive, I got a three to four fold speed increase. The code was something akin to this:
#pragma omp parallel for default(none) shared(ptr1,ptr2) firstprivate(const2,k,inc,max)
for(i = 0; i < max; i += inc)
{
float *ptr1_ = (*ptr1)[i>>k][0];
float v = ptr2[i/const2];
// do stuff with data
}
So then I went off and optimized the single threaded code. In the process, I discovered I could increment pointers instead of indexing them, and for whatever reason, this yielded a reasonable speed improvement. The problem now is, I can't figure out a simple way to tell OpenMP where the pointers start for each thread. Essentially, what I came up with was the following.
#pragma omp parallel default(none) shared(ptr1,ptr2) firstprivate(const1,inc,max)
{
int chunk = max / (omp_get_num_threads()*inc);
chunk = (chunk < 1)? 1: chunk;
float *ptr1_ = &(*ptr1)[0][0] + chunk*omp_get_thread_num()*const1;
float *ptr2_ = ptr2 + chunk*omp_get_thread_num();
#pragma omp for schedule(static,chunk)
for(i = 0; i < max; i += inc)
{
// do stuff with data
ptr1_ += const1;
ptr2_++;
}
}
This seems to work, although it took me some time to figure out how to compute the pointer offsets since inc is not one and I wasn't sure how this affects the meaning of chunk. Further, I'm not so confident it works correctly around the edge cases (when max is small or not an integral multiple of num_threads*inc). Also the code is much more complicated, and there are direct calls to omp functions, which I did not have to rely on before. And finally, by forcing the scheduling method and chunk size, I'm restricting the OpenMP implementation's potential optimizations.
So my question is, is there any way to get the loop index of the thread at the start of the loop so I don't have to manually compute the pointer offsets in this convoluted way?

C and OpenMP : nowait for loop in a do loop

I'm trying to minimize some function with may parameters P doing a Particle Swarm Optimization. What you need to know to help me, is that this procedure requires the computation of a particular function (that I call foo) for different indices i (each index is linked to a set of parameters P). The time that foo spend on each i is unpredictable and can vary a lot for different i. As soon as one v[i] has been computed, I'd like to start the computation of another one. This procedure stops when one i optimizes the function (it means that the corresponding set of parameters P has been found).
So I want to parallelize the computation with OpenMP. I do the following thing :
unsigned int N(5);
unsigned int last_called(0);
std::vector<double> v(N,0.0);
std::vector<bool> busy(N,false);
std::vector<unsigned int> already_ran(N,0);
std::vector<unsigned int> to_run_in_priority(N);
for(unsigned int i(0);i<N;i++){
to_run_in_priority[i]=i;
}
do{
#pramga omp parallel for nowait
for(unsigned int i=0;i<N;i++){
if(!busy[to_run_in_priority[i]]){
busy[to_run_in_priority[i]]=true;
already_ran[to_run_in_priority[i]]++;
foo(v[to_run_in_priority[i]]);
busy[to_run_in_priority[i]]=false;
}
/*update to_run_in_priority*/
}
} while (/*condition*/)
If for instance I have 4 threads and N=5. The program will enter the for loop and lunch 4 threads. When the first i has been computed, it will lunch the 5th one. But then what will happen ?
Will the code continue, reach the while condition and enter again the for loop? If it does, as all the threads are busy, what will it do?
If what I want to do isn't clear, let me list I want :
call foo for each i on a separate thread (thread_numbers<N)
if some thread isn't running anymore, call again foo for some i (the next i that should run must be different than all other running i and it should be a i that has run less times than the others).
do a loop on the two previous items until convergence criteria has been reached.
If i'm not clear enough, don't hesitate to ask precisions.
Abstracting somewhat from your code you seem to want to write something like
#pramga omp parallel for
for(unsigned int i=0;i<N;i++){
v[i] = foo(i)
}
but you are concerned that, because the computational effort of calls to foo(i) varies greatly, this simple approach will be poorly balanced if each thread simply gets a range of values of i on which to operate.
You are probably right to be concerned but I think, if my diagnosis is correct, that you are going the wrong way about balancing your program. The wrong way that you are going about is to try to program the allocation of work to threads yourself.
Try this (pseudo-code) instead:
#pramga omp parallel for schedule(dynamic,10)
for(unsigned int i=0;i<N;i++){
v[i] = foo(i)
}
Notice the introduction of the schedule clause, in this case with parameters dynamic and 10. This directs the run-time to hand out bunches of values of i, 10 elements at a time, to individual threads. Depending on the distribution of run-time for each value of i, and of the magnitude of N this may be sufficient to balance the load.
Then again, it may not, and you may want to investigate the schedule clause a bit further, in particular both dynamic and guided scheduling.
If none of this appeals investigate the OpenMP task construct; I don't have time (nor, to be honest, the skill) to offer pseudo-code for that right now.
Finally, if I have misunderstood your question, which often happens, then this answer is probably worthless to you.
You could try something like this:
#pragma omp parallel
{
#pramga omp for schedule(dynamic) nowait
for(unsigned int i=0;i<N;i++){
//parallel code with foo(i)
}
#pragma omp single
{
//serial code
}
}
Let's say N is 5 and there are 4 threads. The four threads start running and the first thread to finish starts i=4 and the first thread to finish after that enters the single statement.
Thanks to your comments and answers, this is the solution i came up with.
unsigned int i(0);
unsigned int ip(0);
unsigned int N(10);
std::vector<bool> free(N,true)
#pragma omp parallel for schedule(dynamic,1) firstprivate(ip)
for(unsigned int iter=0; iter<maxiter_; iter++){
#pragma omp critical
{
i++;
ip = (i-1) % particle_.size();
if(!free_[ip]){iter -= 1;}
}
if(free_[ip]){
free_[ip]=false;
if(ip<2){sleep(2);}
else{ sleep(5);}
free_[ip]=true;
}
}
with the few tests I did, it seems to work. but does anyone have arguments against what I did?

Using OpenMP stops GCC auto vectorising

I have been working on making my code able to be auto vectorised by GCC, however, when I include the the -fopenmp flag it seems to stop all attempts at auto vectorisation. I am using the ftree-vectorize -ftree-vectorizer-verbose=5 to vectorise and monitor it.
If I do not include the flag, it starts to give me a lot of information about each loop, if it is vectorised and why not. The compiler stops when I try to use the omp_get_wtime() function, since it can't be linked. Once the flag is included, it simply lists every function and tells me it vectorised 0 loops in it.
I've read a few other places the issue has been mentioned, but they don't really come to any solutions: http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032. Does OpenMP have its own way of handling vectorisation? Does I need to explicitly tell it to?
There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i] * d;
In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:
struct omp_fn_0_s
{
int N;
double *a;
double *b;
double *c;
double d;
};
void omp_fn_0(struct omp_fn_0_s *data)
{
int start, end;
int nthreads = omp_get_num_threads();
int threadid = omp_get_thread_num();
// This is just to illustrate the case - GCC uses a bit different formulas
start = (data->N * threadid) / nthreads;
end = (data->N * (threadid+1)) / nthreads;
for (int i = start; i < end; i++)
data->a[i] = data->b[i] + data->c[i] * data->d;
}
...
struct omp_fn_0_s omp_data_o;
omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;
GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();
N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;
The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:
struct fun_s
{
double *restrict a;
double *restrict b;
double *restrict c;
double d;
int n;
};
void fun1(double *restrict a,
double *restrict b,
double *restrict c,
double d,
int n)
{
int i;
for (i = 0; i < n; i++)
a[i] = b[i] + c[i] * d;
}
void fun2(struct fun_s *par)
{
int i;
for (i = 0; i < par->n; i++)
par->a[i] = par->b[i] + par->c[i] * par->d;
}
One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1 but fails to vectorise that in fun2 citing the same reason as when it compiles the OpenMP code.
The reason for this is that the vectoriser is unable to prove that par->d does not lie within the memory that par->a, par->b, and par->c point to. This is not always the case with fun1, where two cases are possible:
d is passed as a value argument in a register;
d is passed as a value argument on the stack.
On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1 if instructed to generate 32-bit x86 code with the -m32 option.
GCC 4.7 gets around this by inserting run-time checks which ensure that neither d nor par->d get aliased.
Getting rid of d removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i];
I'll try to briefly answer your question.
Does OpenMP have its own way of handling vectorisation?
Yes... but starting from the incoming OpenMP 4.0. The link posted above provides a good insight on this construct. The current OpenMP 3.1, on the other hand, is not "aware" of the SIMD concept. What happens therefore in practice (or, at least, in my experience) is that auto-vectorization mechanisms are inhibited whenever an openmp worksharing construct is used on a loop. Anyhow the two concepts are orthogonal and you can still benefit from both (see this other answer).
Do I need to explicitly tell it to?
I am afraid yes, at least at present. I would start rewriting the loops under consideration in a way that makes vectorization explicit (i.e. I will use intrinsics on Intel platform, Altivec on IBM and so on).
You are asking "why GCC can't do vectorization when OpenMP is enabled?".
It seems that this may be a bug of GCC :)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
Otherwise, an OpenMP API may introduce dependency (either control or data) that prevents automatic vectorization. To auto-vertorize, a given code must be data/control-dependency free. It's possible that using OpenMP may cause some spurious dependency.
Note: OpenMP (prior to 4.0) is to use thread-level parallelism, which is orthogonal to SIMD/vectorization. A program can use both OpenMP and SIMD parallelism at the same time.
I ran across this post while searching for comments about the gcc 4.9 option openmp-simd, which should activate OpenMP 4 #pragma omp simd without activating omp parallel (threading). gcc bugzilla pr60117 (confirmed) shows a case where the pragma omp prevents auto-vectorization which occurred without the pragma.
gcc doesn't vectorize omp parallel for even with the simd clause (parallel regions can auto-vectorize only the inner loop nested under a parallel for). I don't know any compiler other than icc 14.0.2 which could be recommended for implementation of #pragma omp parallel for simd; with other compilers, SSE intrinsics coding would be required to get this effect.
The Microsoft compiler doesn't perform any auto-vectorization inside parallel regions in my tests, which show clear superiority of gcc for such cases.
Combined parallelization and vectorization of a single loop has several difficulties, even with the best implementation. I seldom see more than 2x or 3x speedup by adding vectorization to a parallel loop. Vectorization with AVX double data type, for example, effectively cuts the chunk size by a factor of 4. Typical implementation can achieve aligned data chunks only for the case where the entire array is aligned, and the chunks also are exact multiples of the vector width. When the chunks are not all aligned, there is inherent work imbalance due to the varying alignments.

Resources