I am playing around with the omp task function, and has encountered a problem. I have the following code:
void function1(int n, int j)
{
//Serial part
#pragma omp parallel
{
#pragma omp single nowait
{
//execcute function1() for every n
for (i=0; i<n; i++)
{
//create a task for computing only large numbers of n
if (i <= 10)
//execute serial
else
#pragma omp task
//call function again
function1(n, j+1);
printf("task id:\n", omp_get_thread_num());
}
}
}
}
Right now the code will produce the correct result, but the performance is much slower than the original serial version. After some investigation I found that all the tasks are executed in thread 0 regardless there are 4 threads running in total. Does anyone know what's going on here? Thanks in advance!
The omp single nowait pragma means that the following block is to be executed by a single thread. In this case that means that your entire loop is executed by one thread. This should resolve your issue:
void function1(int n, int j)
{
//Serial part
#pragma omp parallel
{
//execcute function1() for every n
for (i=0; i<n; i++) //Notice this is outside the pragma
{
#pragma omp single nowait //Notice this is inside the loop
{
//create a task for computing only large numbers of n
if (i <= 10)
//execute serial
else
#pragma omp task
//call function again
function1(n, j+1);
printf("task id:\n", omp_get_thread_num());
}
}
}
}
Related
I want to have 2 threads groups run at the same time. For example, 2 threads are executing code block 1 and another 2 threads are executing another code segment. There was a stackoverflow question here OpenMP: Divide all the threads into different groups and I changed the code to see if it suits the logic I need in my code.
I have the below code with me.
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 1
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS;
int nthreads2 = NUM_THREADS;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(nthreads1, nthreads2, t1) num_threads(2)
#pragma omp single
{
#pragma omp task // section 1
#pragma omp parallel for num_threads(nthreads1) shared(t1)
for (int i=0; i<nthreads1; i++) {
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
t1[i] = 1;
}
#pragma omp task // section 2
#pragma omp parallel for num_threads(nthreads2) shared(t1, t2)
for (int j=0; j<nthreads2; j++) {
while (!t1[j]) {
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
}
}
return 0;
}
To check if my code is running 2 thread groups at once, I set the thread count in each group to 1 and then I keep a boolean list that is initialized to 0.
In the first code segment, I set the boolean value to 1, and in the 2nd code segment, I check the boolean value to break out of the while loop. It seems like the above code is only run by 1 thread, because if the thread starts running the 2nd code block/section, then it keeps stuck inside the while loop because another thread is not setting the boolean value to 1.
How to run 2 thread groups in parallel?
UPDATE: My use case: I am writing a word count map-reduce program using OpenMP. I want one thread group 2 read files which adds read lines to a queue. I want another thread group to process lines from those queues and update the counts in a chained hash table. I already wrote the code to first do the reading to formulate the queues and then do the mapping to take data from queues and generate word counts -- but I want to change my program to have 2 thread groups to do reading and mapping in parallel -- at the same time. That's why I made this shortcode to check how I can implement 2 thread groups, running in parallel executing 2 different code segments.
It seems like the above could be solved using single directives with nowait and task directives. The below approach puts the tasks on to a queue and then threads pickup work from the queue. So ideally, 2 thread groups will be working on 2 different tasks, which is what is required in the question. Below is the code;
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 1
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS;
int nthreads2 = NUM_THREADS;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(nthreads1, nthreads2, t1)
{
#pragma omp single nowait // section 1
for (int i=0; i<nthreads1; i++) {
#pragma omp task
{
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
t1[i] = 1;
}
}
#pragma omp single nowait // section 2
for (int j=0; j<nthreads2; j++) {
#pragma omp task
{
while (!t1[j]) {
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
}
}
}
return 0;
}
Also, you can simply use an if-else statement inside the #pragma omp construct to run 2 threadgroups in parallel
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 2
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS/2;
int nthreads2 = NUM_THREADS/2;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(t1) num_threads(NUM_THREADS) {
int i = omp_get_thread_num(); // section 1
if (i<nthreads1) {
printf("Section 1: thread %d\n",i);
t1[i] = 1;
} else {
int j = i - nthreads1;
while (!t1[j]) {
printf("Section 2: thread %d, shift_value %d\n", i, j);
}
}
}
return 0;
}
I would like to run the following code (below). I want to spawn two independent threads, each one would run a parallel for loop. Unfortunately, I get an error. Apparently, parallel for cannot be spawned inside section. How to solve that?
#include <omp.h>
#include "stdio.h"
int main()
{
omp_set_num_threads(10);
#pragma omp parallel
#pragma omp sections
{
#pragma omp section
#pragma omp for
for(int i=0; i<5; i++) {
printf("x %d\n", i);
}
#pragma omp section
#pragma omp for
for(int i=0; i<5; i++) {
printf(". %d\n", i);
}
} // end parallel and end sections
}
And the error:
main.cpp: In function ‘int main()’:
main.cpp:14:9: warning: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region [enabled by default]
main.cpp:20:9: warning: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region [enabled by default]
Here you have to use nested parallelism. The problem with the omp for in the sections is that all the threads in scope have to take part in the omp for, and they clearly don't — they're broken up by sections. So you have to introduce functions, and do nested paralleism within the functions.
#include <stdio.h>
#include <omp.h>
void doTask1(const int gtid) {
omp_set_num_threads(5);
#pragma omp parallel
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i=0; i<5; i++) {
printf("x %d %d %d\n", i, tid, gtid);
}
}
}
void doTask2(const int gtid) {
omp_set_num_threads(5);
#pragma omp parallel
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i=0; i<5; i++) {
printf(". %d %d %d\n", i, tid, gtid);
}
}
}
int main()
{
omp_set_num_threads(2);
omp_set_nested(1);
#pragma omp parallel
{
int gtid = omp_get_thread_num();
#pragma omp sections
{
#pragma omp section
doTask1(gtid);
#pragma omp section
doTask2(gtid);
} // end parallel and end sections
}
}
OpenMP cannot create parallel regions inside parallel regions. This is due to the fact that OpenMP create at the beginning of the program num_threads parallel threads, in non parallel regions the others are not used and sleep. They have done this, as the frequent generation of new threads is quite slow compared to waking sleeping threads.
Therefore you should parallelize only the loops:
#include <omp.h>
#include "stdio.h"
int main()
{
omp_set_num_threads(10);
#pragma omp parallel for
for(int i=0; i<5; i++) {
printf("x %d\n", i);
}
#pragma omp parallel for
for(int i=0; i<5; i++) {
printf(". %d\n", i);
}
}
Practically, optimal number of threads is equal to number of available CPU cores. So, every parallel for should be handled in all available cores, which is impossible inside of omp sections. So, what you are trying to achieve, is not optimal. tune2fs' suggestion to execute two loops without sections makes sense and gives the best possible performance. You can execute parallel loops inside of another functions, but this "cheating" doesn't give performance boost.
As I'm unable to use omp tasks (using visual studio 2015) I'm trying to find a workaround for a nested loop task. The code is as follows:
#pragma omp parallel
{
for (i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp for
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
#pragma omp barrier
}
I want all my threads to execute largeNum times, but wait for the memset to be set to zero, and then i want the largeFunc be performed by each thread. There are no data dependencies that I have found.
I've got what the omp directives all jumbled in my head at this point. Does this solution work? Is there a better way to do without tasks?
Thanks!
What about just this code?
#pragma omp parallel private( i, n )
for ( i = 0; i < largeNum; i++ ) {
#pragma omp for
for ( n = 0; n < num; n++ ) {
results[n] = 0;
largeFunc( param[n], &results[n] );
}
}
As far as I understand your problem, the intialisation part should be taken care of without the need of the single directive, provided the actual type of results supports the assignment to 0. Moreover, your initial code was lacking of the private( i ) declaration. Finally, the barrier shouldn't be needed .
Why do you want all your threads to execute largeNUM ? do you then depend on index i inside your largeFunc in someway if yes
#pragma omp parallel for
for (int i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp barrier
// #pragma omp for -- this is not needed since it has to be coarse on the outermost level. However if the below function does not have anything to do with the outer loop then see the next example
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
}
If you do not depend on i then
for (i = 0; i < largeNum; i++)
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
#pragma omp parallel for
for (int n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
However I feel you want the first one. In general you parallelise on the outermost loop. Placing pragmas in the innerloop will slow your code down due to overheads if there is not enough work to be done.
I have a code that reads like this
void h(particles *p) {
#pragma omp parallel for
for (int i = 0; i < maxThreads; ++i) {
int id = omp_get_thread_num();
for (int j = 0; j < dtnum; ++j) {
f( p, id);
if ( j % 50 == 0 ) {
if (id == 0) {
g(p);
}
#pragma omp barrier
}
}
}
}
void f(particles *p, int id) {
for (int i = id * prt_thread; i < (id + 1)*prt_thread; ++i) {
x(p[i]);
}
}
Basically I want to:
1)spawn a given amount of threads. each thread will process a chuck of p according to thread's id
2)each element of p must be processed dtnum times. The processing involve random events
3)every 50 iterations, one thread must perform another operation, while the other threads wait
Problem: gcc says warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
what can I do?
It's hard to tell from the very schematic code, but if all you want to do is sync up every so many iterations, it seems easiest to pull the iteration loop out of the parallel omp for loop - which seems clearer anyway - and just do
const int iterblocks=50;
#pragma omp parallel shared(p, dtnum) default(none)
for (int jblock=0; jblock<dtnum/iterblocks; jblock++) {
for (int j=0; j<iterblocks; j++) {
#pragma omp for nowait
for (int i=0; i<prt; i++)
x(p[i]);
}
#pragma omp barrier
#pragma omp single
g(p);
#pragma omp barrier
}
I think your code is wrong. You said :
each element of p must be processed dtnum times.
But each element of p will be execute maxThreads*dtnum times.
Could you be more explicit on what your code's supposed to do ?
I've just started looking into OpenMP and have been reading up on tasks. It seems that Sun's example here is actually slower than the sequential version. I believe this has to do with the overhead of task creation and management. Is this correct? If so, is there a way to make the code faster using tasks without changing the algorithm?
int fib(int n)
{
int i, j;
if (n<2)
return n;
else
{
#pragma omp task shared(i) firstprivate(n)
i=fib(n-1);
#pragma omp task shared(j) firstprivate(n)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
The only sensible way is to cut the parallelism at a certain level, below which it makes no sense as the overhead becomes larger than the work being done. The best way to do it is to have two separate fib implementations - a serial one, let's say fib_ser, and a parallel one, fib - and to switch between them under a given threshold.
int fib_ser(int n)
{
if (n < 2)
return n;
else
return fib_ser(n-1) + fib_ser(n-2);
}
int fib(int n)
{
int i, j;
if (n <= 20)
return fib_ser(n);
else
{
#pragma omp task shared(i)
i = fib(n-1);
#pragma omp task shared(j)
j = fib(n-2);
#pragma omp taskwait
return i+j;
}
}
Here the threshold is n == 20. It was chosen arbitrarily and its optimal value would be different on different machines and different OpenMP runtimes.
Another option would be to dynamically control the tasking with the if clause:
int fib(int n)
{
int i, j;
if (n<2)
return n;
else
{
#pragma omp task shared(i) if(n > 20)
i=fib(n-1);
#pragma omp task shared(j) if(n > 20)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
This would switch off the two explicit tasks if n <= 20, and the code would execute in serial, but there would still be some overhead from the OpenMP code transformation, so it would run slower than the previous version with the separate serial implementation.
You are correct, the slow down has to do with the overhead of task creation and management.
Adding the if(n > 20) is a great way to prune the task tree, and even more optimization can be made by not creating the second task. When the new tasks get spawned, the parent thread does nothing. The code can be sped up by allowing the parent task to take care of one of the fib calls instead of creating more tasks:
int fib(int n){
int i, j;
if (n < 2)
return n;
else{
#pragma omp task shared(i) if(n > 20)
i=fib(n-1);
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}