OpenMP - for loop thread assignment

OpenMP - for loop thread assignment - c

Suppose I have an array with indices 0..n-1. Is there a way to choose which cells each thread would handle? e.g. thread 0 would handle cells 0 and 5 , thread 1 would handle cells 1 and 6 and so on..

Have you looked at the schedule clause for the parallel for?
#pragma omp for schedule(static, 1)
should implement what you want, you can experiment with the schedule clause using the following simple code:
#include<stdio.h>
#include<omp.h>
int main(){
int i,th_id;
#pragma omp parallel for schedule(static,1)
for ( i = 0 ; i < 10 ; ++i){
th_id = omp_get_thread_num();
printf("Thread %d is on %d\n",th_id,i);
}
}

You can even be more explicit:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int ith = omp_get_thread_num();
for (int i=ith; i<n; i+=nth)
{
// handle cell i.
}
}
this should do exactly what you want: thread ith handles cell ith, ith+nth, ith+2*nth, ith+3*nth and so on.

Related

OpenMP: Having threads execute a for loop in order

I'd like to run something like the following:
for (int index = 0; index < num; index++)
I'd want to run the for loop with four threads, with the threads executing in the order: 0,1,2,3,4,5,6,7,8, etc...
That is, for the threads to be working on index =n,(n+1),(n+2),(n+3) (in any particular ordering but always in this pattern), I want iterations of index = 0,1,2,...(n-1) to already be finished.
Is there a way to do this? Ordered doesn't really work here as making the body an ordered section would basically remove all parallelism for me, and scheduling doesn't seem to work because I don't want a thread to be working on threads k->k+index/4.
Thanks for any help!

You can do this with, not a parallel for loop, but a parallel region that manages its own loop inside, plus a barrier to make sure all running threads have hit the same point in it before being able to continue. Example:
#include <stdatomic.h>
#include <stdio.h>
#include <omp.h>
int main()
{
atomic_int chunk = 0;
int num = 12;
int nthreads = 4;
omp_set_num_threads(nthreads);
#pragma omp parallel shared(chunk, num, nthreads)
{
for (int index; (index = atomic_fetch_add(&chunk, 1)) < num; ) {
printf("In index %d\n", index);
fflush(stdout);
#pragma omp barrier
// For illustrative purposes only; not needed in real code
#pragma omp single
{
puts("After barrier");
fflush(stdout);
}
}
}
puts("Done");
return 0;
}
One possible output:
$ gcc -std=c11 -O -fopenmp -Wall -Wextra demo.c
$ ./a.out
In index 2
In index 3
In index 1
In index 0
After barrier
In index 4
In index 6
In index 5
In index 7
After barrier
In index 10
In index 9
In index 8
In index 11
After barrier
Done

I'm not sure I understand your request correctly. If I try to summarize how I interpret it, that would be something like: "I want 4 threads sharing the iterations of a loop, with always the 4 threads running at most on 4 consecutive iterations of the loop".
If that's what you want, what about something like this:
int nths = 4;
#pragma omp parallel num_thread( nths )
for( int index_outer = 0; index_outer < num; index_outer += nths ) {
int end = min( index_outer + nths, num );
#pragma omp for
for( int index = index_outer; index < end; index++ ) {
// the loop body just as before
} // there's a thread synchronization here
}

OpenMP: Run 2 thread groups in parallel

I want to have 2 threads groups run at the same time. For example, 2 threads are executing code block 1 and another 2 threads are executing another code segment. There was a stackoverflow question here OpenMP: Divide all the threads into different groups and I changed the code to see if it suits the logic I need in my code.
I have the below code with me.
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 1
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS;
int nthreads2 = NUM_THREADS;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(nthreads1, nthreads2, t1) num_threads(2)
#pragma omp single
{
#pragma omp task // section 1
#pragma omp parallel for num_threads(nthreads1) shared(t1)
for (int i=0; i<nthreads1; i++) {
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
t1[i] = 1;
}
#pragma omp task // section 2
#pragma omp parallel for num_threads(nthreads2) shared(t1, t2)
for (int j=0; j<nthreads2; j++) {
while (!t1[j]) {
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
}
}
return 0;
}
To check if my code is running 2 thread groups at once, I set the thread count in each group to 1 and then I keep a boolean list that is initialized to 0.
In the first code segment, I set the boolean value to 1, and in the 2nd code segment, I check the boolean value to break out of the while loop. It seems like the above code is only run by 1 thread, because if the thread starts running the 2nd code block/section, then it keeps stuck inside the while loop because another thread is not setting the boolean value to 1.
How to run 2 thread groups in parallel?
UPDATE: My use case: I am writing a word count map-reduce program using OpenMP. I want one thread group 2 read files which adds read lines to a queue. I want another thread group to process lines from those queues and update the counts in a chained hash table. I already wrote the code to first do the reading to formulate the queues and then do the mapping to take data from queues and generate word counts -- but I want to change my program to have 2 thread groups to do reading and mapping in parallel -- at the same time. That's why I made this shortcode to check how I can implement 2 thread groups, running in parallel executing 2 different code segments.

It seems like the above could be solved using single directives with nowait and task directives. The below approach puts the tasks on to a queue and then threads pickup work from the queue. So ideally, 2 thread groups will be working on 2 different tasks, which is what is required in the question. Below is the code;
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 1
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS;
int nthreads2 = NUM_THREADS;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(nthreads1, nthreads2, t1)
{
#pragma omp single nowait // section 1
for (int i=0; i<nthreads1; i++) {
#pragma omp task
{
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
t1[i] = 1;
}
}
#pragma omp single nowait // section 2
for (int j=0; j<nthreads2; j++) {
#pragma omp task
{
while (!t1[j]) {
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
}
}
}
return 0;
}
Also, you can simply use an if-else statement inside the #pragma omp construct to run 2 threadgroups in parallel
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 2
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS/2;
int nthreads2 = NUM_THREADS/2;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(t1) num_threads(NUM_THREADS) {
int i = omp_get_thread_num(); // section 1
if (i<nthreads1) {
printf("Section 1: thread %d\n",i);
t1[i] = 1;
} else {
int j = i - nthreads1;
while (!t1[j]) {
printf("Section 2: thread %d, shift_value %d\n", i, j);
}
}
}
return 0;
}

OMP 2.0 Nested For Loops

As I'm unable to use omp tasks (using visual studio 2015) I'm trying to find a workaround for a nested loop task. The code is as follows:
#pragma omp parallel
{
for (i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp for
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
#pragma omp barrier
}
I want all my threads to execute largeNum times, but wait for the memset to be set to zero, and then i want the largeFunc be performed by each thread. There are no data dependencies that I have found.
I've got what the omp directives all jumbled in my head at this point. Does this solution work? Is there a better way to do without tasks?
Thanks!

What about just this code?
#pragma omp parallel private( i, n )
for ( i = 0; i < largeNum; i++ ) {
#pragma omp for
for ( n = 0; n < num; n++ ) {
results[n] = 0;
largeFunc( param[n], &results[n] );
}
}
As far as I understand your problem, the intialisation part should be taken care of without the need of the single directive, provided the actual type of results supports the assignment to 0. Moreover, your initial code was lacking of the private( i ) declaration. Finally, the barrier shouldn't be needed .

Why do you want all your threads to execute largeNUM ? do you then depend on index i inside your largeFunc in someway if yes
#pragma omp parallel for
for (int i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp barrier
// #pragma omp for -- this is not needed since it has to be coarse on the outermost level. However if the below function does not have anything to do with the outer loop then see the next example
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
}
If you do not depend on i then
for (i = 0; i < largeNum; i++)
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
#pragma omp parallel for
for (int n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
However I feel you want the first one. In general you parallelise on the outermost loop. Placing pragmas in the innerloop will slow your code down due to overheads if there is not enough work to be done.

Longest Common Subsequence with openMP

I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.

#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis

The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).

OpenMP: synchronization inside parallel for

I have a code that reads like this
void h(particles *p) {
#pragma omp parallel for
for (int i = 0; i < maxThreads; ++i) {
int id = omp_get_thread_num();
for (int j = 0; j < dtnum; ++j) {
f( p, id);
if ( j % 50 == 0 ) {
if (id == 0) {
g(p);
}
#pragma omp barrier
}
}
}
}
void f(particles *p, int id) {
for (int i = id * prt_thread; i < (id + 1)*prt_thread; ++i) {
x(p[i]);
}
}
Basically I want to:
1)spawn a given amount of threads. each thread will process a chuck of p according to thread's id
2)each element of p must be processed dtnum times. The processing involve random events
3)every 50 iterations, one thread must perform another operation, while the other threads wait
Problem: gcc says warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
what can I do?

It's hard to tell from the very schematic code, but if all you want to do is sync up every so many iterations, it seems easiest to pull the iteration loop out of the parallel omp for loop - which seems clearer anyway - and just do
const int iterblocks=50;
#pragma omp parallel shared(p, dtnum) default(none)
for (int jblock=0; jblock<dtnum/iterblocks; jblock++) {
for (int j=0; j<iterblocks; j++) {
#pragma omp for nowait
for (int i=0; i<prt; i++)
x(p[i]);
}
#pragma omp barrier
#pragma omp single
g(p);
#pragma omp barrier
}

I think your code is wrong. You said :
each element of p must be processed dtnum times.
But each element of p will be execute maxThreads*dtnum times.
Could you be more explicit on what your code's supposed to do ?