I am trying to initialize a dynamic array using OpenMP in C but it seems to be slower than the serial method. The function I am using is
int* createArray(int size, int num) {
int i;
int* M = (int*)malloc(size*sizeof(int));
srand(time(NULL));
double start = omp_get_wtime();
#pragma omp parallel for num_threads(num)
for (i = 0; i < size; i++) {
M[i] = rand() % (MAX_NUMBER - MIN_NUMBER + 1) + MIN_NUMBER;
}
double end = omp_get_wtime();
printf("Create Array %f\n",end-start);
return M;
}
I get an array of the given size containing random numbers but the fewer threads I use the faster the function is. Am I doing something wrong?
In general, a parallel application running slower than the corresponding sequential implementation usually comes from either the overhead of starting the threads or the bottleneck of having threads that are not perfectly independent (e.g. through shared data).
Here, the former is true because you are calling rand(). This function uses somehow global variables which have to be shared between threads. A way to overcome this would be to use a private seed for each thread. Furthermore, did you notice that your array is not really random when you have multiple threads? You could make the seed provided to srand() a function of omp_get_thread_num() to solve this.
I'm quite sure your program is suffering a problem called "false sharing" cache.
The article below explains it quite well.
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
this often affects performance a lot.
you can quickly have a test. add below to your omp pragma
schedule(static, 16)
this should improve a lot. then you can dig further about false sharing.
Related
I have the following code which I would like to make it parallel (pseudocode)
int na = 10000000;
int nb = na;
double A[na];
double B[2*na];
double a;
for(int j=0;j<nb;j++)
{
i = rand() % na;
A[i]+=5.0*i;
B[i+10]+=6.0*i*i;
}
Of course, I cannot use #pragma omp parallel for because sometimes (which cannot be predicted) the same element will be accessed by two threads at the same time. How can this block of code be parallelized? Thank you
There are two ways to do that:
Use an atomic update on the values
#pragma omp parallel for
for(int j=0;j<nb;j++)
{
// make sure to declare i locally!
int i = fun();
#pragma omp atomic
A[i]+=5.0*i;
}
This is the simplest way. Each write is executed atomically and therefore more expensive. You also need to consider that accessing adjacent elements from multiple threads becomes expensive (false sharing). Use this if A is large and you do a lot of computations per uptate.
Use an array-reduction
#pragma omp parallel for reduction(+:A)
for(int j=0;j<nb;j++)
{
// make sure to declare i locally!
int i = fun();
A[i]+=5.0*i;
}
This creates a local copy of A for each thread which is added together to the outside A after the parallel region. This requires more memory and some computation after, but parallel code itself can work most efficiently. Use this if A is small and the are little computations for each update.
BTW: Never use rand() in parallel applications, it is not defined as thread safe and sometimes it is implemented with a lock and becomes horribly inefficient.
EDIT: In your example with B you can safely apply either omp atomic or reduction separately to the statement since each operation only needs to be performance atomically independently.
I am using the following code for finding the sum of the elements of an array using OpenMP tasks construct.
The code is yielding the correct results till n = 10000.
But beyond that, I am getting a segmentation fault. Using gdb, I found that the fault occurs in one of the recursive calls to reduce(). There is no problem with the input array allocation, and I have verified that.
Does anyone have any suggestion on what the problem might be?
int reduce (int *arr, unsigned long int n)
{
int x;
if (n <= 0)
return 0;
#pragma omp parallel
{
#pragma omp single nowait
{
#pragma omp task shared(x)
x = reduce(arr, n-1) + arr[n-1];
#pragma omp taskwait
}
}
return x;
}
It looks like you are encountering a "stack overflow" via the recursion depth of function calls. Remember that most openmp pragmas generate functions themselves, which are probably interfering with the tail-recursion optimization.
If you run via valgrind, it should warn you about overflowing the stack.
dlasalle is correct about the actual error.
However, there are two more fundamental issues on how you use OpenMP tasks. You spawn a parallel region within each recursive call. This means you use nested parallel regions. By default nested parallelism is disabled in OpenMP, and it doesn't make sense here. You want all tasks you spawn during the recursion to be executed by the same pool of threads. To do that, you have to move the parallel/single outside of the recursion, e.g.
int reduce_par(int* arr, unsigned long int n)
{
int x;
if (n <= 0)
return 0;
#pragma omp task shared(x)
x = reduce_par(arr, n - 1) + arr[n - 1];
#pragma omp taskwait
return x;
}
int reduce(int* arr, unsigned long int n)
{
#pragma omp parallel
{
#pragma omp single nowait
{
reduce_par(arr, n);
}
}
}
Even if this wouldn't segfault, and even if you had infinite amount of cores, with infinite memory bandwidth and no thread creation overhead, this still woudn't provide any performance benefit from the parallelization. To figure this out, draw the graph of tasks and their operations and add the dependencies. Try to arrange nodes of the graph in a time axis respecting the task dependencies and see if anything at all can be computed in parallel.
The right solution for a parallel summation is a parallel for worksharing construct with a reduce clause. And if you had to use tasks, you need to use divide and conquer, e.g. spawn two tasks for two halves of the array. And to get reasonable performance you have to stop the task creation / recursion at some minimal workload size in order to keep the overhead manageable.
I have a for loop, each iteration of which is almost completely independent of every other iteration. I did some initial experimenting with OpenMP to see if I could speed it up. Using one simple directive, I got a three to four fold speed increase. The code was something akin to this:
#pragma omp parallel for default(none) shared(ptr1,ptr2) firstprivate(const2,k,inc,max)
for(i = 0; i < max; i += inc)
{
float *ptr1_ = (*ptr1)[i>>k][0];
float v = ptr2[i/const2];
// do stuff with data
}
So then I went off and optimized the single threaded code. In the process, I discovered I could increment pointers instead of indexing them, and for whatever reason, this yielded a reasonable speed improvement. The problem now is, I can't figure out a simple way to tell OpenMP where the pointers start for each thread. Essentially, what I came up with was the following.
#pragma omp parallel default(none) shared(ptr1,ptr2) firstprivate(const1,inc,max)
{
int chunk = max / (omp_get_num_threads()*inc);
chunk = (chunk < 1)? 1: chunk;
float *ptr1_ = &(*ptr1)[0][0] + chunk*omp_get_thread_num()*const1;
float *ptr2_ = ptr2 + chunk*omp_get_thread_num();
#pragma omp for schedule(static,chunk)
for(i = 0; i < max; i += inc)
{
// do stuff with data
ptr1_ += const1;
ptr2_++;
}
}
This seems to work, although it took me some time to figure out how to compute the pointer offsets since inc is not one and I wasn't sure how this affects the meaning of chunk. Further, I'm not so confident it works correctly around the edge cases (when max is small or not an integral multiple of num_threads*inc). Also the code is much more complicated, and there are direct calls to omp functions, which I did not have to rely on before. And finally, by forcing the scheduling method and chunk size, I'm restricting the OpenMP implementation's potential optimizations.
So my question is, is there any way to get the loop index of the thread at the start of the loop so I don't have to manually compute the pointer offsets in this convoluted way?
I am trying to parallelize a vector dot product program using OpenMP. The following code shows what I did.
#define N 1000000
float dotProduct = 0;
float vector1Host[N], vector2Host[N]; //each element in the vectors are initialized to a value between 1 and 2
#pragma omp parallel for private(i) reduction(+:dotProduct)
for (i = 0; i < N; i++)
dotProduct += vector1Host[i] * vector2Host[i];
The answer I get here is slightly different than what I get when I do the multiplication sequentially. Further, when I remove the reduction(+:dotProduct) and calculate the multiplications of each item seperately and add them together later (sequentially) I get the same answer as the completely sequential method.
float productComponents[N];
#pragma omp parallel for private(i)
for (i = 0; i < N; i++)
productComponents[i] += vector1Host[i] * vector2Host[i];
for (i=0; i<N; i++)
dotProduct += productComponents[i];
The issue with this method is the performance. Please help me in finding the error in the first method, or an alternative method with good performance.
Update:
I added the output from a sample run.
N=1000000: Ans=2251335.750000: Time(ms)=2.59163 //sequential
N=1000000: Ans=2251356.750000: Time(ms)=0.65846 //openmp
Floating point operations are not commutative. Therefore it is possible that your code is giving differing and possibly unpredictable results based on the order in which the floats are added to the accumulating variable.
Openmp due to the nature of parallelising the code results in the additions being performed in an arbitrary order and thus causes slightly unpredictable value due to the above non commutative behaviour of floats.
Either you need to accept this unpredictability or serialise the additions.
The other option would be to use a fixed point library which was able to guarantee commutative addition, in which case the answer would be predictable regardless of the resulting order of the additions.
I'm trying to minimize some function with may parameters P doing a Particle Swarm Optimization. What you need to know to help me, is that this procedure requires the computation of a particular function (that I call foo) for different indices i (each index is linked to a set of parameters P). The time that foo spend on each i is unpredictable and can vary a lot for different i. As soon as one v[i] has been computed, I'd like to start the computation of another one. This procedure stops when one i optimizes the function (it means that the corresponding set of parameters P has been found).
So I want to parallelize the computation with OpenMP. I do the following thing :
unsigned int N(5);
unsigned int last_called(0);
std::vector<double> v(N,0.0);
std::vector<bool> busy(N,false);
std::vector<unsigned int> already_ran(N,0);
std::vector<unsigned int> to_run_in_priority(N);
for(unsigned int i(0);i<N;i++){
to_run_in_priority[i]=i;
}
do{
#pramga omp parallel for nowait
for(unsigned int i=0;i<N;i++){
if(!busy[to_run_in_priority[i]]){
busy[to_run_in_priority[i]]=true;
already_ran[to_run_in_priority[i]]++;
foo(v[to_run_in_priority[i]]);
busy[to_run_in_priority[i]]=false;
}
/*update to_run_in_priority*/
}
} while (/*condition*/)
If for instance I have 4 threads and N=5. The program will enter the for loop and lunch 4 threads. When the first i has been computed, it will lunch the 5th one. But then what will happen ?
Will the code continue, reach the while condition and enter again the for loop? If it does, as all the threads are busy, what will it do?
If what I want to do isn't clear, let me list I want :
call foo for each i on a separate thread (thread_numbers<N)
if some thread isn't running anymore, call again foo for some i (the next i that should run must be different than all other running i and it should be a i that has run less times than the others).
do a loop on the two previous items until convergence criteria has been reached.
If i'm not clear enough, don't hesitate to ask precisions.
Abstracting somewhat from your code you seem to want to write something like
#pramga omp parallel for
for(unsigned int i=0;i<N;i++){
v[i] = foo(i)
}
but you are concerned that, because the computational effort of calls to foo(i) varies greatly, this simple approach will be poorly balanced if each thread simply gets a range of values of i on which to operate.
You are probably right to be concerned but I think, if my diagnosis is correct, that you are going the wrong way about balancing your program. The wrong way that you are going about is to try to program the allocation of work to threads yourself.
Try this (pseudo-code) instead:
#pramga omp parallel for schedule(dynamic,10)
for(unsigned int i=0;i<N;i++){
v[i] = foo(i)
}
Notice the introduction of the schedule clause, in this case with parameters dynamic and 10. This directs the run-time to hand out bunches of values of i, 10 elements at a time, to individual threads. Depending on the distribution of run-time for each value of i, and of the magnitude of N this may be sufficient to balance the load.
Then again, it may not, and you may want to investigate the schedule clause a bit further, in particular both dynamic and guided scheduling.
If none of this appeals investigate the OpenMP task construct; I don't have time (nor, to be honest, the skill) to offer pseudo-code for that right now.
Finally, if I have misunderstood your question, which often happens, then this answer is probably worthless to you.
You could try something like this:
#pragma omp parallel
{
#pramga omp for schedule(dynamic) nowait
for(unsigned int i=0;i<N;i++){
//parallel code with foo(i)
}
#pragma omp single
{
//serial code
}
}
Let's say N is 5 and there are 4 threads. The four threads start running and the first thread to finish starts i=4 and the first thread to finish after that enters the single statement.
Thanks to your comments and answers, this is the solution i came up with.
unsigned int i(0);
unsigned int ip(0);
unsigned int N(10);
std::vector<bool> free(N,true)
#pragma omp parallel for schedule(dynamic,1) firstprivate(ip)
for(unsigned int iter=0; iter<maxiter_; iter++){
#pragma omp critical
{
i++;
ip = (i-1) % particle_.size();
if(!free_[ip]){iter -= 1;}
}
if(free_[ip]){
free_[ip]=false;
if(ip<2){sleep(2);}
else{ sleep(5);}
free_[ip]=true;
}
}
with the few tests I did, it seems to work. but does anyone have arguments against what I did?