I want to have 2 threads groups run at the same time. For example, 2 threads are executing code block 1 and another 2 threads are executing another code segment. There was a stackoverflow question here OpenMP: Divide all the threads into different groups and I changed the code to see if it suits the logic I need in my code.
I have the below code with me.
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 1
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS;
int nthreads2 = NUM_THREADS;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(nthreads1, nthreads2, t1) num_threads(2)
#pragma omp single
{
#pragma omp task // section 1
#pragma omp parallel for num_threads(nthreads1) shared(t1)
for (int i=0; i<nthreads1; i++) {
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
t1[i] = 1;
}
#pragma omp task // section 2
#pragma omp parallel for num_threads(nthreads2) shared(t1, t2)
for (int j=0; j<nthreads2; j++) {
while (!t1[j]) {
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
}
}
return 0;
}
To check if my code is running 2 thread groups at once, I set the thread count in each group to 1 and then I keep a boolean list that is initialized to 0.
In the first code segment, I set the boolean value to 1, and in the 2nd code segment, I check the boolean value to break out of the while loop. It seems like the above code is only run by 1 thread, because if the thread starts running the 2nd code block/section, then it keeps stuck inside the while loop because another thread is not setting the boolean value to 1.
How to run 2 thread groups in parallel?
UPDATE: My use case: I am writing a word count map-reduce program using OpenMP. I want one thread group 2 read files which adds read lines to a queue. I want another thread group to process lines from those queues and update the counts in a chained hash table. I already wrote the code to first do the reading to formulate the queues and then do the mapping to take data from queues and generate word counts -- but I want to change my program to have 2 thread groups to do reading and mapping in parallel -- at the same time. That's why I made this shortcode to check how I can implement 2 thread groups, running in parallel executing 2 different code segments.
It seems like the above could be solved using single directives with nowait and task directives. The below approach puts the tasks on to a queue and then threads pickup work from the queue. So ideally, 2 thread groups will be working on 2 different tasks, which is what is required in the question. Below is the code;
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 1
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS;
int nthreads2 = NUM_THREADS;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(nthreads1, nthreads2, t1)
{
#pragma omp single nowait // section 1
for (int i=0; i<nthreads1; i++) {
#pragma omp task
{
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
t1[i] = 1;
}
}
#pragma omp single nowait // section 2
for (int j=0; j<nthreads2; j++) {
#pragma omp task
{
while (!t1[j]) {
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
}
}
}
return 0;
}
Also, you can simply use an if-else statement inside the #pragma omp construct to run 2 threadgroups in parallel
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 2
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = NUM_THREADS/2;
int nthreads2 = NUM_THREADS/2;
int t1[nthreads1];
for (int i=0; i<nthreads1; i++) {
t1[i] = 0;
}
#pragma omp parallel default(none) shared(t1) num_threads(NUM_THREADS) {
int i = omp_get_thread_num(); // section 1
if (i<nthreads1) {
printf("Section 1: thread %d\n",i);
t1[i] = 1;
} else {
int j = i - nthreads1;
while (!t1[j]) {
printf("Section 2: thread %d, shift_value %d\n", i, j);
}
}
}
return 0;
}
Related
I would like to run the following code (below). I want to spawn two independent threads, each one would run a parallel for loop. Unfortunately, I get an error. Apparently, parallel for cannot be spawned inside section. How to solve that?
#include <omp.h>
#include "stdio.h"
int main()
{
omp_set_num_threads(10);
#pragma omp parallel
#pragma omp sections
{
#pragma omp section
#pragma omp for
for(int i=0; i<5; i++) {
printf("x %d\n", i);
}
#pragma omp section
#pragma omp for
for(int i=0; i<5; i++) {
printf(". %d\n", i);
}
} // end parallel and end sections
}
And the error:
main.cpp: In function ‘int main()’:
main.cpp:14:9: warning: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region [enabled by default]
main.cpp:20:9: warning: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region [enabled by default]
Here you have to use nested parallelism. The problem with the omp for in the sections is that all the threads in scope have to take part in the omp for, and they clearly don't — they're broken up by sections. So you have to introduce functions, and do nested paralleism within the functions.
#include <stdio.h>
#include <omp.h>
void doTask1(const int gtid) {
omp_set_num_threads(5);
#pragma omp parallel
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i=0; i<5; i++) {
printf("x %d %d %d\n", i, tid, gtid);
}
}
}
void doTask2(const int gtid) {
omp_set_num_threads(5);
#pragma omp parallel
{
int tid = omp_get_thread_num();
#pragma omp for
for(int i=0; i<5; i++) {
printf(". %d %d %d\n", i, tid, gtid);
}
}
}
int main()
{
omp_set_num_threads(2);
omp_set_nested(1);
#pragma omp parallel
{
int gtid = omp_get_thread_num();
#pragma omp sections
{
#pragma omp section
doTask1(gtid);
#pragma omp section
doTask2(gtid);
} // end parallel and end sections
}
}
OpenMP cannot create parallel regions inside parallel regions. This is due to the fact that OpenMP create at the beginning of the program num_threads parallel threads, in non parallel regions the others are not used and sleep. They have done this, as the frequent generation of new threads is quite slow compared to waking sleeping threads.
Therefore you should parallelize only the loops:
#include <omp.h>
#include "stdio.h"
int main()
{
omp_set_num_threads(10);
#pragma omp parallel for
for(int i=0; i<5; i++) {
printf("x %d\n", i);
}
#pragma omp parallel for
for(int i=0; i<5; i++) {
printf(". %d\n", i);
}
}
Practically, optimal number of threads is equal to number of available CPU cores. So, every parallel for should be handled in all available cores, which is impossible inside of omp sections. So, what you are trying to achieve, is not optimal. tune2fs' suggestion to execute two loops without sections makes sense and gives the best possible performance. You can execute parallel loops inside of another functions, but this "cheating" doesn't give performance boost.
I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.
I am playing around with the omp task function, and has encountered a problem. I have the following code:
void function1(int n, int j)
{
//Serial part
#pragma omp parallel
{
#pragma omp single nowait
{
//execcute function1() for every n
for (i=0; i<n; i++)
{
//create a task for computing only large numbers of n
if (i <= 10)
//execute serial
else
#pragma omp task
//call function again
function1(n, j+1);
printf("task id:\n", omp_get_thread_num());
}
}
}
}
Right now the code will produce the correct result, but the performance is much slower than the original serial version. After some investigation I found that all the tasks are executed in thread 0 regardless there are 4 threads running in total. Does anyone know what's going on here? Thanks in advance!
The omp single nowait pragma means that the following block is to be executed by a single thread. In this case that means that your entire loop is executed by one thread. This should resolve your issue:
void function1(int n, int j)
{
//Serial part
#pragma omp parallel
{
//execcute function1() for every n
for (i=0; i<n; i++) //Notice this is outside the pragma
{
#pragma omp single nowait //Notice this is inside the loop
{
//create a task for computing only large numbers of n
if (i <= 10)
//execute serial
else
#pragma omp task
//call function again
function1(n, j+1);
printf("task id:\n", omp_get_thread_num());
}
}
}
}
I have a code that reads like this
void h(particles *p) {
#pragma omp parallel for
for (int i = 0; i < maxThreads; ++i) {
int id = omp_get_thread_num();
for (int j = 0; j < dtnum; ++j) {
f( p, id);
if ( j % 50 == 0 ) {
if (id == 0) {
g(p);
}
#pragma omp barrier
}
}
}
}
void f(particles *p, int id) {
for (int i = id * prt_thread; i < (id + 1)*prt_thread; ++i) {
x(p[i]);
}
}
Basically I want to:
1)spawn a given amount of threads. each thread will process a chuck of p according to thread's id
2)each element of p must be processed dtnum times. The processing involve random events
3)every 50 iterations, one thread must perform another operation, while the other threads wait
Problem: gcc says warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
what can I do?
It's hard to tell from the very schematic code, but if all you want to do is sync up every so many iterations, it seems easiest to pull the iteration loop out of the parallel omp for loop - which seems clearer anyway - and just do
const int iterblocks=50;
#pragma omp parallel shared(p, dtnum) default(none)
for (int jblock=0; jblock<dtnum/iterblocks; jblock++) {
for (int j=0; j<iterblocks; j++) {
#pragma omp for nowait
for (int i=0; i<prt; i++)
x(p[i]);
}
#pragma omp barrier
#pragma omp single
g(p);
#pragma omp barrier
}
I think your code is wrong. You said :
each element of p must be processed dtnum times.
But each element of p will be execute maxThreads*dtnum times.
Could you be more explicit on what your code's supposed to do ?
Suppose I have an array with indices 0..n-1. Is there a way to choose which cells each thread would handle? e.g. thread 0 would handle cells 0 and 5 , thread 1 would handle cells 1 and 6 and so on..
Have you looked at the schedule clause for the parallel for?
#pragma omp for schedule(static, 1)
should implement what you want, you can experiment with the schedule clause using the following simple code:
#include<stdio.h>
#include<omp.h>
int main(){
int i,th_id;
#pragma omp parallel for schedule(static,1)
for ( i = 0 ; i < 10 ; ++i){
th_id = omp_get_thread_num();
printf("Thread %d is on %d\n",th_id,i);
}
}
You can even be more explicit:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int ith = omp_get_thread_num();
for (int i=ith; i<n; i+=nth)
{
// handle cell i.
}
}
this should do exactly what you want: thread ith handles cell ith, ith+nth, ith+2*nth, ith+3*nth and so on.