I'm trying to minimize some function with may parameters P doing a Particle Swarm Optimization. What you need to know to help me, is that this procedure requires the computation of a particular function (that I call foo) for different indices i (each index is linked to a set of parameters P). The time that foo spend on each i is unpredictable and can vary a lot for different i. As soon as one v[i] has been computed, I'd like to start the computation of another one. This procedure stops when one i optimizes the function (it means that the corresponding set of parameters P has been found).
So I want to parallelize the computation with OpenMP. I do the following thing :
unsigned int N(5);
unsigned int last_called(0);
std::vector<double> v(N,0.0);
std::vector<bool> busy(N,false);
std::vector<unsigned int> already_ran(N,0);
std::vector<unsigned int> to_run_in_priority(N);
for(unsigned int i(0);i<N;i++){
to_run_in_priority[i]=i;
}
do{
#pramga omp parallel for nowait
for(unsigned int i=0;i<N;i++){
if(!busy[to_run_in_priority[i]]){
busy[to_run_in_priority[i]]=true;
already_ran[to_run_in_priority[i]]++;
foo(v[to_run_in_priority[i]]);
busy[to_run_in_priority[i]]=false;
}
/*update to_run_in_priority*/
}
} while (/*condition*/)
If for instance I have 4 threads and N=5. The program will enter the for loop and lunch 4 threads. When the first i has been computed, it will lunch the 5th one. But then what will happen ?
Will the code continue, reach the while condition and enter again the for loop? If it does, as all the threads are busy, what will it do?
If what I want to do isn't clear, let me list I want :
call foo for each i on a separate thread (thread_numbers<N)
if some thread isn't running anymore, call again foo for some i (the next i that should run must be different than all other running i and it should be a i that has run less times than the others).
do a loop on the two previous items until convergence criteria has been reached.
If i'm not clear enough, don't hesitate to ask precisions.
Abstracting somewhat from your code you seem to want to write something like
#pramga omp parallel for
for(unsigned int i=0;i<N;i++){
v[i] = foo(i)
}
but you are concerned that, because the computational effort of calls to foo(i) varies greatly, this simple approach will be poorly balanced if each thread simply gets a range of values of i on which to operate.
You are probably right to be concerned but I think, if my diagnosis is correct, that you are going the wrong way about balancing your program. The wrong way that you are going about is to try to program the allocation of work to threads yourself.
Try this (pseudo-code) instead:
#pramga omp parallel for schedule(dynamic,10)
for(unsigned int i=0;i<N;i++){
v[i] = foo(i)
}
Notice the introduction of the schedule clause, in this case with parameters dynamic and 10. This directs the run-time to hand out bunches of values of i, 10 elements at a time, to individual threads. Depending on the distribution of run-time for each value of i, and of the magnitude of N this may be sufficient to balance the load.
Then again, it may not, and you may want to investigate the schedule clause a bit further, in particular both dynamic and guided scheduling.
If none of this appeals investigate the OpenMP task construct; I don't have time (nor, to be honest, the skill) to offer pseudo-code for that right now.
Finally, if I have misunderstood your question, which often happens, then this answer is probably worthless to you.
You could try something like this:
#pragma omp parallel
{
#pramga omp for schedule(dynamic) nowait
for(unsigned int i=0;i<N;i++){
//parallel code with foo(i)
}
#pragma omp single
{
//serial code
}
}
Let's say N is 5 and there are 4 threads. The four threads start running and the first thread to finish starts i=4 and the first thread to finish after that enters the single statement.
Thanks to your comments and answers, this is the solution i came up with.
unsigned int i(0);
unsigned int ip(0);
unsigned int N(10);
std::vector<bool> free(N,true)
#pragma omp parallel for schedule(dynamic,1) firstprivate(ip)
for(unsigned int iter=0; iter<maxiter_; iter++){
#pragma omp critical
{
i++;
ip = (i-1) % particle_.size();
if(!free_[ip]){iter -= 1;}
}
if(free_[ip]){
free_[ip]=false;
if(ip<2){sleep(2);}
else{ sleep(5);}
free_[ip]=true;
}
}
with the few tests I did, it seems to work. but does anyone have arguments against what I did?
Related
I have the following code which I would like to make it parallel (pseudocode)
int na = 10000000;
int nb = na;
double A[na];
double B[2*na];
double a;
for(int j=0;j<nb;j++)
{
i = rand() % na;
A[i]+=5.0*i;
B[i+10]+=6.0*i*i;
}
Of course, I cannot use #pragma omp parallel for because sometimes (which cannot be predicted) the same element will be accessed by two threads at the same time. How can this block of code be parallelized? Thank you
There are two ways to do that:
Use an atomic update on the values
#pragma omp parallel for
for(int j=0;j<nb;j++)
{
// make sure to declare i locally!
int i = fun();
#pragma omp atomic
A[i]+=5.0*i;
}
This is the simplest way. Each write is executed atomically and therefore more expensive. You also need to consider that accessing adjacent elements from multiple threads becomes expensive (false sharing). Use this if A is large and you do a lot of computations per uptate.
Use an array-reduction
#pragma omp parallel for reduction(+:A)
for(int j=0;j<nb;j++)
{
// make sure to declare i locally!
int i = fun();
A[i]+=5.0*i;
}
This creates a local copy of A for each thread which is added together to the outside A after the parallel region. This requires more memory and some computation after, but parallel code itself can work most efficiently. Use this if A is small and the are little computations for each update.
BTW: Never use rand() in parallel applications, it is not defined as thread safe and sometimes it is implemented with a lock and becomes horribly inefficient.
EDIT: In your example with B you can safely apply either omp atomic or reduction separately to the statement since each operation only needs to be performance atomically independently.
I am using the following code for finding the sum of the elements of an array using OpenMP tasks construct.
The code is yielding the correct results till n = 10000.
But beyond that, I am getting a segmentation fault. Using gdb, I found that the fault occurs in one of the recursive calls to reduce(). There is no problem with the input array allocation, and I have verified that.
Does anyone have any suggestion on what the problem might be?
int reduce (int *arr, unsigned long int n)
{
int x;
if (n <= 0)
return 0;
#pragma omp parallel
{
#pragma omp single nowait
{
#pragma omp task shared(x)
x = reduce(arr, n-1) + arr[n-1];
#pragma omp taskwait
}
}
return x;
}
It looks like you are encountering a "stack overflow" via the recursion depth of function calls. Remember that most openmp pragmas generate functions themselves, which are probably interfering with the tail-recursion optimization.
If you run via valgrind, it should warn you about overflowing the stack.
dlasalle is correct about the actual error.
However, there are two more fundamental issues on how you use OpenMP tasks. You spawn a parallel region within each recursive call. This means you use nested parallel regions. By default nested parallelism is disabled in OpenMP, and it doesn't make sense here. You want all tasks you spawn during the recursion to be executed by the same pool of threads. To do that, you have to move the parallel/single outside of the recursion, e.g.
int reduce_par(int* arr, unsigned long int n)
{
int x;
if (n <= 0)
return 0;
#pragma omp task shared(x)
x = reduce_par(arr, n - 1) + arr[n - 1];
#pragma omp taskwait
return x;
}
int reduce(int* arr, unsigned long int n)
{
#pragma omp parallel
{
#pragma omp single nowait
{
reduce_par(arr, n);
}
}
}
Even if this wouldn't segfault, and even if you had infinite amount of cores, with infinite memory bandwidth and no thread creation overhead, this still woudn't provide any performance benefit from the parallelization. To figure this out, draw the graph of tasks and their operations and add the dependencies. Try to arrange nodes of the graph in a time axis respecting the task dependencies and see if anything at all can be computed in parallel.
The right solution for a parallel summation is a parallel for worksharing construct with a reduce clause. And if you had to use tasks, you need to use divide and conquer, e.g. spawn two tasks for two halves of the array. And to get reasonable performance you have to stop the task creation / recursion at some minimal workload size in order to keep the overhead manageable.
I am trying to parallelize a code to run some simulations on spiking neuron network. This involves one double loop, where I put a statement '#pragma omp parallel for' outside the main loop. Here's the code:
int main(void){
int i,j,count[200];
#pragma omp parallel for
for(i=0;i<200;i++){
count[i] = 0;
for (j=0;j<200;j++){
if (j!=i){
count[i]++;
printf("i: %d j: %d count[i]:%d, count[i]-j:%d\n",i,j,count[i], count[i]-j);
}
}
}
return 0;
}
Looking at the results, some of the values of count[i] exceed 200, even though the loop only goes from 1 to 200. count[i]-j can either be 0,1 or -1, but the values differ widely, even thought each thread would work on one value of i, and the count array depends only on the current value of i. How do I rewrite the code so that I can safely increment count?
You must declare j as private. You can do so explicitly via:
#pragma omp parallel for private(j)
i is implicitly private being the loop variable of the worksharing loop. count is implicitly shared because it it defined outside of the loop. Both of which are desirable here.
However, I strongly recommend to always declare variables al locally as possible, especially when using OpenMP. This way the implicit private/shared is almost always right, and it avoids lots of subtle undefined value reads. It is generally good practice
int count[200];
#pragma omp parallel for
for(int i=0;i<200;i++){
count[i] = 0;
for (int j=0;j<200;j++){
BTW: Your printout of count[i]-j can show completely arbitrary values. It accesses data that it potentially concurrently written by other threads.
I have a for loop, each iteration of which is almost completely independent of every other iteration. I did some initial experimenting with OpenMP to see if I could speed it up. Using one simple directive, I got a three to four fold speed increase. The code was something akin to this:
#pragma omp parallel for default(none) shared(ptr1,ptr2) firstprivate(const2,k,inc,max)
for(i = 0; i < max; i += inc)
{
float *ptr1_ = (*ptr1)[i>>k][0];
float v = ptr2[i/const2];
// do stuff with data
}
So then I went off and optimized the single threaded code. In the process, I discovered I could increment pointers instead of indexing them, and for whatever reason, this yielded a reasonable speed improvement. The problem now is, I can't figure out a simple way to tell OpenMP where the pointers start for each thread. Essentially, what I came up with was the following.
#pragma omp parallel default(none) shared(ptr1,ptr2) firstprivate(const1,inc,max)
{
int chunk = max / (omp_get_num_threads()*inc);
chunk = (chunk < 1)? 1: chunk;
float *ptr1_ = &(*ptr1)[0][0] + chunk*omp_get_thread_num()*const1;
float *ptr2_ = ptr2 + chunk*omp_get_thread_num();
#pragma omp for schedule(static,chunk)
for(i = 0; i < max; i += inc)
{
// do stuff with data
ptr1_ += const1;
ptr2_++;
}
}
This seems to work, although it took me some time to figure out how to compute the pointer offsets since inc is not one and I wasn't sure how this affects the meaning of chunk. Further, I'm not so confident it works correctly around the edge cases (when max is small or not an integral multiple of num_threads*inc). Also the code is much more complicated, and there are direct calls to omp functions, which I did not have to rely on before. And finally, by forcing the scheduling method and chunk size, I'm restricting the OpenMP implementation's potential optimizations.
So my question is, is there any way to get the loop index of the thread at the start of the loop so I don't have to manually compute the pointer offsets in this convoluted way?
I am trying to run following code to understand the functionality of OpenMP lastprivate construct. As per the definition of lastprivate, if I declare a variable lastprivate, it is private to every thread and the value of the thread executing the last iteration of a parallel loop in sequential order is copied to the variable outside of the region.
Here is the code:
int main(void)
{
omp_set_num_threads(5);
int i;
int k =3;
#pragma omp parallel private(i)
{
#pragma omp for lastprivate(k)
for(i=0; i< 5; i++ )
{
int iam = omp_get_thread_num();
k = iam;
printf("k=%d, iam=%d\t",k, iam);
}
}
printf("\n k = %d", k);
}
It produces output something like this:
k=0, iam=0 k=4, iam=4 k=3, iam=3 k=2, iam=2 k=1, iam=1
k = 4
When we have a team of threads working in a 'for', we cannot really guarantee which thread executes last. So, accordingly, the value of last thread should reflect in global 'k'. However, no matter how many times I run the code , the value of 'k' globally (i.e. after the parallel section is over) remains 4.
From the printed values too, we can see that thread 1 executed last. Even if we assume that the prints are not reliable to get the exact running sequence of threads, it seems far from obvious that the thread 4 always runs last, thereby reflecting its value in 'k'.
I would appreciate the help regarding this problem. Thanks.
To be sure which thread executes last, you should print the value of the iteration index (and not replicate the thread id):
#include<stdio.h>
#include<omp.h>
int main() {
int kk;
#pragma omp parallel
{
#pragma omp for schedule(runtime) lastprivate(kk)
for(int ii=0; ii < 1000; ii++ ) {
kk = omp_get_thread_num();
printf("ii = %d, kk = %d\n",ii,kk);
}
}
printf("kk = %d\n", kk);
return 0;
}
If you run this program you will notice that the thread that executes iteration 999 sets the value of kk.
Regarding this sentence (emphasis mine):
When we have a team of threads working in a 'for', we cannot really guarantee which thread executes last.
What you say is generally true, but with one exception (section 2.5. of the OpenMP 3.1 standard):
Different loop regions with the same schedule and iteration count,
even if they occur in the same parallel region, can distribute
iterations among threads differently. The only exception is for the
static schedule...
Now, as you didn't specify any schedule, the following rule holds:
If the loop directive does not have a schedule clause then the current
value of the def-sched-var ICV determines the schedule
If def-sched-var determines a schedule(static) (as I have experienced is many times the case) then the final print of your program will be always k = 4
You are confusing two different ideas of "last".
The standard says "when a lastprivate clause appears on the directive that identifies a worksharing construct, the value of each new list item from the sequentially last iteration of the associated loops"
That says nothing about the order in which things execute, whereas you are assuming that "last" means the temporally last thread to execute.
So if you have static loop scheduling that guarantees that the highest numbered thread will execute the last loop iteration, therefore the value saved will always be that from the highest numbered thread, and it has nothing to do with the specific (random) order in which the threads happened to execute.