OpenMP Parallel for with scheduling in C - c

I want to run a parallel, scheduled(eg. static/dynamic/guided) for-loop, where each thread has its own set of variables, based on their thread-id. I know that any variable declared within the parallel pragma is private, but I don't want to re-declare the variables in every iteration of the for loop.
in my specific situation, I'm counting whenever a set of generating coordinates lies inside or outside of a circle to approximate pi.
I'm using erand48(int[3] seed) to generate these coordinates in each of the threads, and by giving each thread a different set of values for 'seed', I will get a greater variety of numbers to use in the approximation(and is also a requirement for this simulation).
long long int global_result = 0;
int tID = omp_get_thread_num();
int[3] seed;
seed[0] = (((tid*tid + 15) * 3)/7);
seed[1] = ((((tid + tid) * 44)/3) + 2);
seed[2] = tid;
int this_result = 0;
# pragma omp parallel for num_threads(thread_count) schedule(runtime)
for(i = 0; i < chunksize; i++){
x = erand48(seed);
y = erand48(seed);
((x*x+y*y)>=1)?(this_result++):;
}
# pragma omp critical{
global_result+= this_result;
}
This is as best as I can represent what I'm trying to do. I want the values of 'this_result','tid' and 'seed' to have a private scope.

I know that any variable declared within the parallel pragma is
private, but I don't want to re-declare the variables in every
iteration of the for loop.
Separate the #pragma omp parallel for into its two separate components #pragma omp parallel and #pragma omp for. Then you can declare the local variables in the parallel but outside the loop.
Something like this
int global_result = 0;
#pragma omp parallel reduction(+:global_result)
{
int tid = omp_get_thread_num();
int seed = (((tid*tid + 15) * 3)/7);
// Typo, as commented below
// # pragma omp parallel for schedule(runtime)
// What is intended!
# pragma omp for schedule(runtime)
for(i = 0; i < chunksize; i++){
float x = erand48(&seed);
float y = erand48(&seed);
if ((x*x+y*y)>=1)
this_result++;
}
global_result += this_result;
}
There are better ways to calculate pi, though :-)

You can use the clause "private" in your #pragma directive like that:
#pragma omp parallel for private(this_result, tid, seed) num_threads(thread_count) schedule(runtime)
If I understood your question correctly, that should do.

Related

Is a function without loop parallelizable?

considering the code below, can we consider it parallel even if there are no loops?
#include <omp.h>
int main(void) {
#pragma omp parallel
{
int a = 1;
a = 0;
}
return 0;
}
Direct Answer:
Yes, here, the section of your code,
int a = 1;
a = 0;
Runs in parallel, P times, where P is the number of cores on your machine.
For example on a four core machine, the following code (with the relevant imports),
int main(void) {
#pragma omp parallel
{
printf("Thread number %d", omp_get_thread_num());
}
return 0;
}
would output:
Thread number 0
Thread number 1
Thread number 2
Thread number 3
Note that when running in parallel, there is no guarantee on the order of the output, so the output could just as likely be something like:
Thread number 1
Thread number 2
Thread number 0
Thread number 3
Additionally, if you wanted to specify the number of threads used in the parallel region, instead of #pragma omp parallel you could write, #pragma omp parallel num_threads(4).
Further Explanation:
If you are still confused, it may be helpful to better understand the difference between parallel for loops and parallel code regions.
#pragma omp parallel tells the compiler that the following code block may be executed in parallel. It guarantees that all code within the parallel region will have finished execution before continuing to subsequent code.
In the following (toy) example, the programmer is guaranteed that after the parallel region, the array will have all entries set to zero.
int *arr = malloc(sizeof(int) * 128);
const int P = omp_get_max_threads();
#pragma omp parallel num_threads(P)
{
int local_start = omp_get_thread_num();
int local_end = local_start + (100 / P);
for (int i = local_start; i < local_end; ++i) {
arr[i] = 0;
}
}
// any code from here onward is guaranteed that arr contains all zeros!
Ignoring differences in scheduling, this task could equivalently be accomplished using a parallel for loop as follows:
int *arr = malloc(sizeof(int) * 128);
const int P = omp_get_max_threads();
#pragma omp parallel num_threads(P) for
for (int i = 0; i < 128; ++i) {
arr[i] = 0;
}
// any code from here onward is guaranteed that arr contains all zeros!
Essentially, #pragma omp parallel enables you to describe regions of code that can execute in parallel - this can be much more flexible than a parallel for loop. In contrast, #pragma omp parallel for should generally be used to parallelize loops with independent iterations.
I can further elaborate on the differences in performance, if you would like.

Execute for loop iterations in openmp in order with dynamic schedule

I'd like to run a for loop in openmp with dynamic schedule.
#pragma omp for schedule(dynamic,chunk) private(i) nowait
for(i=0;i<n;i++){
//loop code here
}
and I'd like to have each thread executing ordered chunks such that
e.g. thread 1 -> iterations 0 to k
thread2 -> iterations k+1->k+chunk
etc..
Static schedule partly does what I want but I'd like to dynamically load balance the iterations.
Neither ordered clause, if I understood correctly what it does.
My question is how to make sure that the chunks assigned are ordered chunks?
I am using openmp 3.1 with gcc
You can implement this yourself without resorting to omp for, which is considered a convenience function by expert OpenMP programmers.
The following roughly illustrates what you might do. Please check the arithmetic carefully.
#pragma omp parallel
{
int me = omp_get_thread_num();
int nt = omp_get_num_threads();
int chunk = /* divide n by nt appropriately */
int start = me * chunk;
int end = (me+1) * chunk;
if (end > n) end = n;
for (int i = start; i < end; i++) {
/* do work */
}
} /* end parallel */
This does not do any dynamic load-balancing. You can do that yourself by assigning loop iterations unevenly to threads if you know the cost function a priori. You might read up on the inspector-executor model (e.g. 1).

Counting does not work properly in OpenMP

I have the function
void collatz(int startNumber, int endNumber, int* iter, int nThreads)
{
int i, n, counter;
int isodd; /* 1 if n is odd, 0 if even */
#pragma omp parallel for
for (i = startNumber; i <= endNumber; i++)
{
counter = 0;
n = i;
omp_set_num_threads(nThreads);
while (n > 1)
{
isodd = n%2;
if (isodd)
n = 3*n+1;
else
n/=2;
counter++;
}
iter[i - startNumber] = counter;
}
}
It works as I wish when running serial (i.e. compiling without OpenMP or commenting out #pragma omp parallel for and omp_set_num_threads(nThreads);). However, the parallel version produces the wrong result and I think it is because the counter variable need to be set to zero at the beginning of each for loop and perhaps another thread can work with the non-zeroed counter value. But even if I use #pragma omp parallel for private(counter), the problem still occurs. What am I missing?
I compile the program as C89.
Inside your OpenMP parallel region, you are assigning values to the counter, n and isodd scalar variables. These cannot therefore be just shared as they are by default. You need to pay extra attention to them.
A quick analysis shows that as their values is only meaningful inside the parallel region and only for the current thread, so it becomes clear that they need to be declared private.
Adding a private( counter, n, isodd ) clause to your #pragma omp parallel directive should fix the issue.

Partially parallel loops using openmp tasks

Prerequisites:
parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
parallel constructs: OpenMP tasks
compiler: gcc 4.9.x (supports OpenMP 4.0)
Input:
C code with loops
loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
loop body can be partially dependent
loop cannot be split in two loops; loop body should remain solid
anything reasonable can be added to loop or loop body function definition
Code sample:
(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value)
{
int conf;
conf = prepare(config); // independent, does not change “config”
*value = process(conf, *value); // dependent, takes prev., produce next
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
…
for (int i = 0; i < N; i++)
{
loopFunc(configData, &valueData);
}
…
}
Need to:
parallelise loop using omp tasks (omp for / omp sections cannot be used)
“prepare” functions should be executed in parallel with other “prepare” or “process” functions
“process” functions should be ordered according to data dependency
What have been proposed and implemented:
define integer flag
assign to it a number of first iteration
every iteration when it needs data waits for flag to be equal to it’s iteration
update flag value when data for next iteration is ready
Like this:
(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)
void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
int conf;
conf = prepare(config); // independent, do not change “config”
while (*parSync != iteration) // wait for previous to be ready
{
#pragma omp taskyield
}
*value = process(conf, *value); // dependent, takes prev., produce next
*parSync = iteration + 1; // inform next about readiness
return;
}
int main()
{
int N = 100;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
…
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
for (int i = 0; i < N; i++)
{
#pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
loopFunc(configData, &valueData, &parallelSync, i);
}
#pragma omp taskwait
…
}
What happened:
It fails. :)
The reason was that openmp task occupies openmp thread.
For example, if we define 5 openmp threads (as in the code above).
“For” loop generates 100 tasks.
OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.
If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.
What's next?
I have no other ideas how to implement the required mode of computation.
Current solution
Thanks for the idea to #parallelgeek below
int main()
{
int N = 10;
char* configData; // never changes
int valueData = 0; // initial value
volatile int parallelSync = 0;
int workers;
volatile int workingTasks = 0;
...
omp_set_num_threads(5);
#pragma omp parallel
#pragma omp single
{
workers = omp_get_num_threads()-1; // reserve 1 thread for task generation
for (int i = 0; i < N; i++)
{
while (workingTasks >= workers)
{
#pragma omp taskyield
}
#pragma omp atomic update
workingTasks++;
#pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
{
loopFunc(configData, &valueData, &parallelSync, i);
#pragma omp atomic update
workingTasks--;
}
}
#pragma omp taskwait
}
}
AFAIK volatiles don't prevent hardware reordering, that's why you
could end up with a mess in memory, because data is not written yet,
while flag is already seen by the consuming thread as true.
That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
Intertask dependencies could also be met by using atomic flags from C11.

Multiple pragmas directives on for loop (C and VS 2013)

I'm trying to use OpenMP to split a for loop computation to multiple threads. Additionally, I'm trying to instruct the compiler to vectorize each chunk assigned to each thread. The code is the following:
#pragma omp for private(i)
__pragma(loop(ivdep))
for (i = 0; i < 4096; i++)
vC[i] = vA[i] + SCALAR * vB[i];
The problem is that both pragmas expect the for loop right after.
Is there any smart construct to make this work?
Some might argue that due to the for loop splitting with OpenMP, the vectorization of the loop won't work. However I read that the #pragma omp for divides the loop into a number of contiguous chunks equal to the thread count. Is thitt?
What about using #pragma omp for simd private(i) instead of the pragma + __pragma() ?
Edit: since OpenMP 4 doesn't seem to be an option for you, you can manually split your loop to get rid of the #pragma omp for by just computing the index limits by hand using omp_get_num_threads() and omp_get_thread_num(), and then keep the ivdep for the per-thread loop.
Edit 2: since I'm a nice guy and since this is boilerplate (more common when programming in MPI, but still) but quite annoying to get right when you do it for the first time, here is a possible solution:
#pragma omp parallel
{
int n = 4096;
int tid = omp_get_thread_num();
int nth = omp_get_num_threads();
int chunk = n / nth;
int beg = tid * chunk + min( tid, n % nth );
int end = ( tid + 1 ) * chunk + min( tid + 1, n % nth );
#pragma ivdep
for ( int i = beg; i < end; i++ ) {
vC[i] = vA[i] + SCALAR * vB[i];
}
}

Resources