How to measure execution time of each thread in openmp? - c

I'd like to measure the time that each thread spends doing a chunk of code. I'd like to see if my load balancing strategy equally divides chunks among workers.
Typically, my code looks like the following:
#pragma omp parallel for schedule(dynamic,chunk) private(i)
for(i=0;i<n;i++){
//loop code here
}
UPDATE
I am using openmp 3.1 with gcc

You can just print the per-thread time this way (not tested, not even compiled):
#pragma omp parallel
{
double wtime = omp_get_wtime();
#pragma omp for schedule( dynamic, 1 ) nowait
for ( int i=0; i<n; i++ ) {
// whatever
}
wtime = omp_get_wtime() - wtime;
printf( "Time taken by thread %d is %f\n", omp_get_thread_num(), wtime );
}
NB the nowaitthan removes the barrier at the end of the for loop, otherwise this wouldn't have any interest.
And of couse, using a proper profiling tool is a way better approach...

Related

Using omp_get_num_threads() inside the parallel section

I would like to set the number of threads in OpenMP. When I use
omp_set_num_threads(2);
printf("nthread = %d\n", omp_get_num_threads());
#pragma omp parallel for
...
I see nthreads=1. Explained here, the number of reported threads belongs to serial section which is always 1. However, when I move it to the line after #pragma, I get compilation error that after #pragma, for is expected. So, how can I fix that?
Well, yeah, omp parallel for expects a loop in the next line. You can call omp_get_num_threads inside that loop. Outside a parallel section, you can call omp_get_max_threads for the maximum number of threads to spawn. This is what you are looking for.
int max_threads = omp_get_max_threads();
#pragma omp parallel for
for(...) {
int current_threads = omp_get_num_threads();
assert(current_threads == max_threads);
}
#pragma omp parallel
{
int current_threads = omp_get_num_threads();
# pragma omp for
for(...) {
...
}
}

OpenMP average of an array

I'm trying to learn OpenMP for a program I'm writing. For part of it I'm trying to implement a function to find the average of a large array. Here is my code:
double mean(double* mean_array){
double mean = 0;
omp_set_num_threads( 4 );
#pragma omp parallel for reduction(+:mean)
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
}
printf("hello %d\n", omp_get_thread_num());
mean = mean/aSize;
return mean;
}
However if I run the code it runs slower than the sequential version. Also for the print statement I get:
hello 0
hello 0
Which doesn't make much sense to me, shouldn't there be 4 hellos?
Any help would be appreciated.
First, the reason why you are not seeing 4 "hello"s, is because the only part of the program which is executed in parallel is the so called parallel region enclosed within an #pragma omp parallel. In your code that is the loop body (since the omp parallel directive is attached to the for statement), the printf is in the sequential part of the program.
rewriting the code as follows would do the trick:
double mean = 0;
#pragma omp parallel num_threads(4)
{
#pragma omp for reduction(+:mean)
for (int i=0; i<aSize; i++) {
mean += mean_array[i];
}
mean /= aSize;
printf("hello %d\n", omp_get_thread_num());
}
Second, the fact your program runs slower than the sequential version, it can depend on multiple factors. First of all, you need to make sure the array is large enough so that the overhead of creating those threads (which usually happens when the parallel region is created) is negligible. Also, for small arrays you may be running into "cache false sharing" issues in which threads are competing for the same cache line causing performance degradation.

Openmp: increase for loop iteration number

I have this parallel for loop
struct p
{
int n;
double *l;
}
#pragma omp parallel for default(none) private(i) shared(p)
for (i = 0; i < p.n; ++i)
{
DoSomething(p, i);
}
Now, it is possible that inside DoSomething(), p.n is increased because new elements are added to p.l. I'd like to process these elements in a parallel fashion. OpenMP manual states that parallel for can't be used with lists, so DoSomething() adds these p.l's new elements to another list which is processed sequentially and then it is joined back with p.l. I don't like this workaround. Anyone knows a cleaner way to do this?
A construct to support dynamic execution was added to OpenMP 3.0 and it is the task construct. Tasks are added to a queue and then executed as concurrently as possible. A sample code would look like this:
#pragma omp parallel private(i)
{
#pragma omp single
for (i = 0; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
}
This will spawn a new parallel region. One of the threads will execute the for loop and create a new OpenMP task for each value of i. Each different DoSomething() call will be converted to a task and will later execute inside an idle thread. There is a problem though: if one of the tasks add new values to p.l, it might happen after the creator thread has already exited the for loop. This could be fixed using task synchronisation constructs and an outer loop like this:
#pragma omp single
{
i = 0;
while (i < p.n)
{
for (; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
#pragma omp taskwait
#pragma omp flush
}
}
The taskwait construct makes for the thread to wait until all queued tasks are executed. If new elements were added to the list, the condition of the while would become true again and a new round of tasks creation will happen. The flush construct is supposed to synchronise the memory view between threads and e.g. update optimised register variables with the value from the shared storage.
OpenMP 3.0 is supported by all modern C compilers except MSVC, which is stuck at OpenMP 2.0.

Can the producer/consumer ( bounded-buffer ) be sped up through openMP?

I'm implementing the producer/consumer problem for homework, and I have to compare the sequential algorithm with the parallel one, and my parallel one seems to only be able to run either at the same speed or slower than the sequential one. I've come to the conclusion that using a queue is a limiting factor and it won't speed up my algorithm.
Is this the case or am I just coding it wrong?
int main() {
long sum = 0;
unsigned long serial = ::GetTickCount();
for(int i = 0; i < test; i++){
enqueue(rand()%54354);
sum+= dequeue();
}
printf("%d \n",sum);
serial = (::GetTickCount() - serial);
printf("Serial Program took: %f seconds\n", serial * .001);
sum = 0;
unsigned long omp = ::GetTickCount();
#pragma omp parallel for num_threads(128) default(shared)
for(int i = 0; i < test; i++){
enqueue(rand()%54354);
sum+= dequeue();
}
#pragma omp barrier //joins all threads
omp = (::GetTickCount() - omp);
printf("%d \n",sum);
printf("OpenMP Program took: %f seconds\n", omp * .001);
getchar();
}
Problem #1:
You have rand() inside the parallel region.
rand() is not thread-safe. It uses global/static variables. So calling it concurrently from multiple threads will lead to unexpected (possibly undefined) behavior.
That aside, the data-races resulting from concurrent calls to rand() will lead to a lot of cache coherency stalls. This is likely the source of the slowdown.
Problem #2:
Is enqueue() and dequeue() thread-safe?
If it isn't, then you need to fix that first. If it is, how are you synchronizing it?
If it's just a critical region that allows only one thread at a time to access the queue, then that kind of defeats the whole purpose of parallelism.
Problem #3:
This line modifies the sum variable in each iteration:
sum += dequeue();
Note that all the threads will be doing this concurrently. So you need to declare sum as a reduction variable.

Using OpenMP to calculate the value of PI

I'm trying to learn how to use OpenMP by parallelizing a monte carlo code that calculates the value of PI with a given number of iterations. The meat of the code is this:
int chunk = CHUNKSIZE;
count=0;
#pragma omp parallel shared(chunk,count) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for ( i=0; i<niter; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
}
}
pi=(double)count/niter*4;
printf("# of trials= %d , estimate of pi is %g \n",niter,pi);
Though this is not yielding the proper value for pi given 10,000 iterations. If all the OpenMP stuff is taken out, it works fine. I should mention that I used the monte carlo code from here: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html
I'm just using it to try to learn OpenMP. Any ideas why it's converging on 1.4ish? Can I not increment a variable with multiple threads? I'm guessing the problem is with the variable count.
Thanks!
Okay, I found the answer. I needed to use the REDUCTION clause. So all I had to modify was:
#pragma omp parallel shared(chunk,count) private(i)
to:
#pragma omp parallel shared(chunk) private(i,x,y,z) reduction(+:count)
Now it's converging at 3.14...yay

Resources