Weird construct in rodinia 3.1 myocyte benchmark - c

I am currently working on a conceptual RISCV SIMT architecture, and our simulator emulates only some needed library calls. We are trying to run openmp rodinia 3.1 benchmarks but as we only have pthread support, I am trying to translate simple statically scheduled openmp code into pthread code.
I found in the myocyte benchmark this kind of construction:
// master.c
void master(params) {
// declaration of th_id
int th_id;
// no initialization of th_id
#pragma omp parallel private(th_id)
{
// code that uses th_id as a "thread id" value
}
}
// main.c
#pragma omp parallel for
for (i=0; i<N; i++) {
master(params);
}
As I understand, the developers count on the #pragma in the master.c code to initialize the variable th_id, but I couldn't find where it is stated in the openmp documentation. Is assuming th_id to be recognized and initialized by ompenmp totaly fine or wrong ?

This OpenMP benchmark code is totally broken. There should be something like this at the beginning of the parallel region:
th_id = omp_get_thread_num();
It obtains the ID of the calling thread and is a number that varies between 0 and the number of threads executing the parallel region minus 1, with 0 corresponding to the master thread.
Again, this code is really broken and seems to have been translated from Fortran. There is an out-of-bounds array access:
int th_count[4];
...
#pragma omp parallel private(th_id)
{
...
if (th_id == th_count[4]) {
...
}
I'd say that you should simply scrap the Myocyte benchmark.

Related

When Worksharing Constructs Inside a critical Construct is useful in OpenMP?

Among OpenMP examples the following code can be found 6.2 Worksharing Constructs Inside a critical Construct:
void critical_work()
{
int i = 1;
#pragma omp parallel sections
{
#pragma omp section
{
#pragma omp critical (name)
{
#pragma omp parallel
{
#pragma omp single
{
i++;
}
}
}
}
}
}
Have you ever used this structure? Under what circumstances is it the best option in real life? My only guess is that it can be useful in error handling, what else?
I think this particular example just demonstrates that this kind of code is still conforming with the standard.
If the question is just about having a worksharing construct inside a critical construct (inside a worksharing construct), I could roughly imagine hierarchical applications where you generally have two layers of nested OpenMP parallelism, but most work is done outside of the critical region, e.g.:
void mostly_uncritical_work()
{
#pragma omp parallel
{
#pragma omp parallel
{
/* main workload */
}
#pragma omp critical (name)
{
#pragma omp parallel
{
/* smaller amount of work but still big enough */
/* to profit from parallelization */
}
}
}
}
So in the end the question boils down to "Are there applications for nested OpenMP parallelism?" and there my answer would certainly be yes. I use it for example to have a team of 2 threads in the outer team, one of them simulating things on a GPU and the other one analyzing the output of the GPU using an inner team of threads.

How to fork a large number of threads in OpenMP?

for some reason I need to stress my processor and I want to fork a lot of threads in OpenMP. In pthreads you can easily do it using a for loop since it is forking a thread is just a function call. But in OpenMP you have to have something like this:
#pragma omp parallel sections
{
#pragma omp section
{
//section 0
}
#pragma omp section
{
//section 1
}
.... // repeat omp section for n times
}
I am just wondering if there is any easier way to fork a large number of threads in OpenMP?
You don't need to do anything special, almost. Just write code for a compute-intensive task and put it inside a parallel region. Then indicate what number of threads you want. In order to do that, you use omp_set_dynamic(0) to disable dynamic threads (this helps to achieve the number of threads you want, but it still won't be guaranteed), then omp_set_num_threads(NUM_THREADS) to indicate what number of threads you want.
Then each thread will clone the task you indicate in the code. Simple as that.
const int NUM_THREADS = 100;
omp_set_dynamic(0);
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
// How many threads did we really get? Let's write it once only.
#pragma omp single
{
cout << "using " << omp_get_num_threads() << " threads." << std::endl;
}
// write some compute-intensive code here
// (be sure to print the result at the end, so that
// the compiler doesn't throw away useless instructions)
}
To do what you want, you get the thread number and then do different things based on which thread you are.
// it's not guaranteed you will actually get this many threads
omp_set_num_threads( NUM_THREADS );
int actual_num_threads;
#pragma omp parallel
{
#pragma omp single
{
actual_num_threads = omp_get_num_threads();
}
int me = omp_get_thread_num();
if ( me < actual_num_threads / 2 ) {
section1();
}
else {
section2();
}
}

How to nest parallel loops in a sequential loop with OpenMP

I am currently working on a matrix computation with OpenMP. I have several loops in my code, and instead on calling for each loop #pragma omp parallel for[...] (which create all the threads and destroy them right after) I would like to create all of them at the beginning, and delete them at the end of the program in order to avoid overhead.
I want something like :
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp for[...]
for(...)
}
The problem is that I have some parts those have to be execute by only one thread, but in a loop, which contains loops those have to be execute in parallel... This is how it looks:
//have to be execute by only one thread
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp parallel for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
(The loop boundaries are quite small in my example. In my program the number of iterations can goes until 20.000...)
One of my first idea was to do something like this:
//have to be execute by only one thread
#pragma omp parallel //creating all the threads at the beginning
{
#pragma omp master //or single
{
int a=0,b=0,c=0;
for(a ; a<5 ; a++)
{
//some stuff
//loops which have to be parallelize
#pragma omp for private(b,c) schedule(static) collapse(2)
for (b=0 ; b<8 ; b++);
for(c=0 ; c<10 ; c++)
{
//some other stuff
}
//end of the parallel zone
//stuff to be execute by only one thread
}
}
} //deleting all the threads
It doesn't compile, I get this error from gcc: "work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region".
I know it surely comes from the "wrong" nesting, but I can't understand why it doesn't work. Do I need to add a barrier before the parallel zone ? I am a bit lost and don't know how to solve it.
Thank you in advance for your help.
Cheers.
Most OpenMP runtimes don't "create all the threads and destroy them right after". The threads are created at the beginning of the first OpenMP section and destroyed when the program terminates (at least that's how Intel's OpenMP implementation does it). There's no performance advantage from using one big parallel region instead of several smaller ones.
Intel's runtimes (which is open source and can be found here) has options to control what threads do when they run out of work. By default they'll spin for a while (in case the program immediately starts a new parallel section), then they'll put themselves to sleep. If the do sleep, it will take a bit longer to start them up for the next parallel section, but this depends on the time between regions, not the syntax.
In the last of your code outlines you declare a parallel region, inside that use a master directive to ensure that only the master thread executes a block, and inside the master block attempt to parallelise a loop across all threads. You claim to know that the compiler errors arise from incorrect nesting but wonder why it doesn't work.
It doesn't work because distributing work to multiple threads within a region of code which only one thread will execute doesn't make any sense.
Your first pseudo-code is better, but you probably want to extend it like this:
#pragma omp parallel
{
#pragma omp for[...]
for(...)
#pragma omp single
{ ... }
#pragma omp for[...]
for(...)
}
The single directive ensures that the block of code it encloses is only executed by one thread. Unlike the master directive single also implies a barrier at exit; you can change this behaviour with the nowait clause.

Openmp: increase for loop iteration number

I have this parallel for loop
struct p
{
int n;
double *l;
}
#pragma omp parallel for default(none) private(i) shared(p)
for (i = 0; i < p.n; ++i)
{
DoSomething(p, i);
}
Now, it is possible that inside DoSomething(), p.n is increased because new elements are added to p.l. I'd like to process these elements in a parallel fashion. OpenMP manual states that parallel for can't be used with lists, so DoSomething() adds these p.l's new elements to another list which is processed sequentially and then it is joined back with p.l. I don't like this workaround. Anyone knows a cleaner way to do this?
A construct to support dynamic execution was added to OpenMP 3.0 and it is the task construct. Tasks are added to a queue and then executed as concurrently as possible. A sample code would look like this:
#pragma omp parallel private(i)
{
#pragma omp single
for (i = 0; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
}
This will spawn a new parallel region. One of the threads will execute the for loop and create a new OpenMP task for each value of i. Each different DoSomething() call will be converted to a task and will later execute inside an idle thread. There is a problem though: if one of the tasks add new values to p.l, it might happen after the creator thread has already exited the for loop. This could be fixed using task synchronisation constructs and an outer loop like this:
#pragma omp single
{
i = 0;
while (i < p.n)
{
for (; i < p.n; ++i)
{
#pragma omp task
DoSomething(p, i);
}
#pragma omp taskwait
#pragma omp flush
}
}
The taskwait construct makes for the thread to wait until all queued tasks are executed. If new elements were added to the list, the condition of the while would become true again and a new round of tasks creation will happen. The flush construct is supposed to synchronise the memory view between threads and e.g. update optimised register variables with the value from the shared storage.
OpenMP 3.0 is supported by all modern C compilers except MSVC, which is stuck at OpenMP 2.0.

How do I ask OpenMP to create threads only once at each run of the program?

I am trying to parallelize a large program that is written by a third-party. I cannot disclose the code, but I will try and give the closest example of what I wish to do.
Based on the code below. As you can see, since the clause "parallel" is INSIDE the while loop, the creation/destruction of the threads are(is) done with each iteration, which is costly.
Given that I cannot move the Initializors...etc to be outside the "while" loop.
--Base code
void funcPiece0()
{
// many lines and branches of code
}
void funcPiece1()
{
// also many lines and branches of code
}
void funcCore()
{
funcInitThis();
funcInitThat();
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
funcPiece0();
}//omp section
#pragma omp section
{
funcPiece1();
}//omp section
}//omp sections
}//omp parallel
}
int main()
{
funcInitThis();
funcInitThat();
#pragma omp parallel
{
while(1)
{
funcCore();
}
}
}
What I seek to do is to avoid the creation/destruction per-iteration, and make it once at the start/end of the program. I tried many variations to the displacement of the "parallel" clause. What I basically has the same essence is the below: (ONLY ONE thread creation/destruction per-program run)
--What I tried, but failed "illegal access" in the initializing functions.
void funcPiece0()
{
// many lines and branches of code
}
void funcPiece1()
{
// also many lines and branches of code
}
void funcCore()
{
funcInitThis();
funcInitThat();
//#pragma omp parallel
// {
#pragma omp sections
{
#pragma omp section
{
funcPiece0();
}//omp section
#pragma omp section
{
funcPiece1();
}//omp section
}//omp sections
// }//omp parallel
}
int main()
{
funcInitThis();
funcInitThat();
while(1)
{
funcCore();
}
}
--
Any help would be highly appreciated!
Thanks!
OpenMP only creates worker thread at start. parallel pragma does not spawn thread. How do you determine the thread are spawned?
This can be done! The key here is to move the loop inside one single parallel section and make sure that whatever is used to determine whether to repeat or not, all threads will make exactly the same decision. I've used shared variables and do a synchronization just before the loop condition is checked.
So this code:
initialize();
while (some_condition) {
#pragma omp parallel
{
some_parallel_work();
}
}
can be transformed into something like this:
#pragma omp parallel
{
#pragma omp single
{
initialize(); //if initialization cannot be parallelized
}
while (some_condition_using_shared_variable) {
some_parallel_work();
update_some_condition_using_shared_variable();
#pragma omp flush
}
}
The most important thing is to be sure that every thread makes the same decision at the same points in your code.
As a final thought, essentially what one is doing is trading the overhead for creating/destroying threads (every time a section of #pragma omp parallel begins/ends) into synchronization overhead for the decision making of the threads. I think synchronizing should be faster however there are some many parameters at play here that this may not always be.

Resources