OpenMP: Having threads execute a for loop in order

OpenMP: Having threads execute a for loop in order - c

I'd like to run something like the following:
for (int index = 0; index < num; index++)
I'd want to run the for loop with four threads, with the threads executing in the order: 0,1,2,3,4,5,6,7,8, etc...
That is, for the threads to be working on index =n,(n+1),(n+2),(n+3) (in any particular ordering but always in this pattern), I want iterations of index = 0,1,2,...(n-1) to already be finished.
Is there a way to do this? Ordered doesn't really work here as making the body an ordered section would basically remove all parallelism for me, and scheduling doesn't seem to work because I don't want a thread to be working on threads k->k+index/4.
Thanks for any help!

You can do this with, not a parallel for loop, but a parallel region that manages its own loop inside, plus a barrier to make sure all running threads have hit the same point in it before being able to continue. Example:
#include <stdatomic.h>
#include <stdio.h>
#include <omp.h>
int main()
{
atomic_int chunk = 0;
int num = 12;
int nthreads = 4;
omp_set_num_threads(nthreads);
#pragma omp parallel shared(chunk, num, nthreads)
{
for (int index; (index = atomic_fetch_add(&chunk, 1)) < num; ) {
printf("In index %d\n", index);
fflush(stdout);
#pragma omp barrier
// For illustrative purposes only; not needed in real code
#pragma omp single
{
puts("After barrier");
fflush(stdout);
}
}
}
puts("Done");
return 0;
}
One possible output:
$ gcc -std=c11 -O -fopenmp -Wall -Wextra demo.c
$ ./a.out
In index 2
In index 3
In index 1
In index 0
After barrier
In index 4
In index 6
In index 5
In index 7
After barrier
In index 10
In index 9
In index 8
In index 11
After barrier
Done

I'm not sure I understand your request correctly. If I try to summarize how I interpret it, that would be something like: "I want 4 threads sharing the iterations of a loop, with always the 4 threads running at most on 4 consecutive iterations of the loop".
If that's what you want, what about something like this:
int nths = 4;
#pragma omp parallel num_thread( nths )
for( int index_outer = 0; index_outer < num; index_outer += nths ) {
int end = min( index_outer + nths, num );
#pragma omp for
for( int index = index_outer; index < end; index++ ) {
// the loop body just as before
} // there's a thread synchronization here
}

Related

Confused with the OpenMP threadprivate variable's initialization

I'm learning OpenMP these days and I just met the "threadprivate" directive. The code snippet below written by myself didn't output the expected result:
// **** File: fun.h **** //
void seed(int x);
int drand();
// ********************* //
// **** File: fun.c **** //
extern int num;
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
// **** File: main.c **** //
#include "fun.h"
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int num = 0;
#pragma omp threadprivate(num)
int main()
{
int num_inital = 4;
seed(num_inital);
printf("At the beginning, num = %d\n", num); // should num be 4?
#pragma omp parallel for num_threads(2) schedule(static,1) copyin(num)
for (int ii = 0; ii < 4; ii++) {
int my_rank = omp_get_thread_num();
//printf("Before processing, in thread %d num = %d\n", my_rank,num);
int num_in_loop = drand();
printf("Thread %d is processing loop %d: num = %d\n", my_rank,ii, num_in_loop);
}
system("pause");
return 0;
}
// ********************* //
Here list my questions:
Why the result of printf("At the beginning, num = %d\n", num); is num = 0 instead of num = 4?
As for the parallel for loop, multiple executions produce different results one of which is:
Thread 1 is processing loop 1: num = 5
Thread 0 is processing loop 0: num = 6
Thread 1 is processing loop 3: num = 7
Thread 0 is processing loop 2: num = 8
It seems that num is initialized to 4 in the for loop which denotes that the num in copyin clause is equal to 4. Why num in printf("At the beginning, num = %d\n", num) is different from that in copyin?
In OpenMP website, it said
In parallel regions, references by the master thread will be to the copy of the variable in the thread that encountered the parallel region.
According to this explanation, Thread 0 (the master thread) should firstly contains num = 4. Therefore, loop 0's output should always be: Thread 0 is processing loop 0: num = 5. Why the result above is different?
My working environment is win10 operating system with VS2015.

I think the problem is within the fun.c compilation unit. The compiler cannot determine the extern int num; variable is also a TLS one.
I will include directive #pragma omp threadprivate(num) in this file:
// **** File: fun.c **** //
extern int num;
#pragma omp threadprivate(num)
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
In any case, the compiler should warn about it at the linking phase.

The copyin clause is meant to be used in OpenMP teams (eg. computation on computing accelerators).
Indeed, the OpenMP documentation says:
These clauses support the copying of data values from private or threadprivate variables on one implicit task or thread to the corresponding variables on other implicit tasks or threads in the team.
Thus, in you case, you should rather use the clause firstprivate.
Please note that the version (5.0) of the OpenMP documentation your are reading is probably not supported by VS2015. I advise you to read an older version compatible with VS2015. The results of the compiled program are likely to be undefined.

error with threadprivate in OpenMP

I ran an example of OpenMP program to test the function of threadprivate, but the result is unexpected and nondetermined. (The program is run on ubuntu 14.04 in VMware.)
the source code is shown in the following:
int counter = 0;
#pragma omp threadprivate (counter)
int inc_counter()
{
counter++;
return counter;
}
void main()
{
#pragma omp parallel sections copyin(counter)
{
#pragma omp section
{
int count1;
for(int iter = 0; iter < 100; ++iter)
count1 = inc_counter();
printf("count1 = %1d\n", count1);
}
#pragma omp section
{
int count2;
for(int iter=0; iter<200; iter++)
count2 = inc_counter();
printf("count2 = %d\n", count2);
}
}
printf("counter = %d\n", counter);
}
The output of the program is:
The correct output should be:
count1 = 100
count2 = 200
counter = 0/100/200?
What's wrong with it?

threadprivate differs from private in the sense that it doesn't have to allocate local variable for each thread, the variable only have to be unique for each thread. One of you threads (master) uses global counter defined in int counter = 0;. Therefore this thread changes this value to 100, 200 or leave it unchanged (0), depending on what section this single thread started to execute.
As you highlighted it seems weird for you why program is giving next results for (count1,count2,counter): (100,300,300) and (100,300,0).
(100,300,300)
Master thread executes both sections. You can check it by launching your code with single thread: OMP_NUM_THREADS=1 ./ex_omp
(100,300,0)
Some thread execute both sections while master is idle. You can check it by introducing section (alongside your two):
#pragma omp section
{
sleep(1); // #include <unistd.h>
printf("hope it is 0 (master) -> %d\n", omp_get_thread_num());
}
If you have 2 threads and master starts to execute this section then another thread with high probability execute your two other sections and you will get (100,300,0) as expected. Launch for example as OMP_NUM_THREADS=2 ./ex_omp.
If it still seems wrong that count2 = 300 you should notice that count is not private for section, it is private for a thread, while this thread can execute both sections.

Modifying loop variable (index) in OpenACC

I have a situation that I need to repeat a specific iteration of the loop multiple times. So, in that specific iteration, I am reducing the index one step so that next increment of the loop index makes no difference.
This approach, which is the approach I have to implement, works for multi-threaded OpenMP codes. However, it does not work for OpenACC (for both multicore and tesla targets). I get the following error:
Floating point exception (core dumped)
Here is the code for both of cases:
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
int main() {
int x = 52;
int count = 5;
int i;
omp_set_num_threads(6);
#pragma omp parallel for
for(i=0;i<100;i++) {
if(i == x) {
printf("%d\n", i);
i--;
count--;
if(count == 0)
x = 10000;
}
}
int gpu_count = 0;
count = 5;
x = 52;
#pragma acc parallel loop independent
for(i=0;i<1000000;i++) {
if(i == x) {
#pragma acc atomic
gpu_count++;
i--;
count--;
if(count == 0)
x = 2000000;
}
}
printf("gpu_count: %d\n", gpu_count);
return 0;
}
For OpenMP, I get the correct output:
52
52
52
52
52
But, for the OpenACC, I get the abovementioned error.
If I comment line 35 (i--;), the code will be executed correctly and it will output number of repeated iterations (which is 1).
Note: I am using PGI 16.5 with Geforce GTX 970 and CUDA 7.5.
I compile with PGI compiler like following:
pgcc -mp -acc -ta=multicore -g f1.c
So, my question is: why I see such a behavior? Can't I change the loop index variable in OpenACC?

Your OpenMP version is in error. You're relying on a static schedule where the chunk size is larger than "count". If you increase the number of OMP threads so the chunk size is smaller than count, or if you change the schedule to interleave the chunks (i.e. "schedule(static,1)"), then you'll get wrong answers. There's also race conditions on "x" and "count".
Note that OpenACC scheduling is more like OpenMP "static,1" so that vectors can access contiguous blocks of memory across a worker (aka a CUDA warp). So your algorithm wont work here as well.
Also, by using the "independent" clause (which is implied when using "parallel loop"), you are asserting to the compiler that this loop does not contain dependencies or that the user will handle them via the "atomic" directive. However, changing the loop index variable inside the body of the loop will create a loop dependency since the value of the loop index depends on if the previous iteration changed it's value.
Edit: Below is an example which is a parallelizable version of your code.
% cat test2.c
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
int main() {
int x = 52;
int count = 5;
int i;
int mycnt;
#pragma omp parallel for schedule(static,1) private(mycnt)
for(i=0;i<100;i++) {
if(i == x) {
mycnt = count;
while(mycnt > 0) {
printf("%d\n", i);
mycnt--;
}
}
}
#ifdef _OPENACC
int gpu_count = 0;
#pragma acc parallel loop reduction(+:gpu_count)
for(i=0;i<1000000;i++) {
if(i == x) {
mycnt = count;
while(mycnt > 0) {
gpu_count++;
mycnt--;
}
}
}
printf("gpu_count: %d\n", gpu_count);
#endif
return 0;
}
% pgcc -fast -mp -acc test2.c -Minfo=mp,acc
main:
13, Parallel region activated
Parallel loop activated with static cyclic schedule
24, Barrier
Parallel region terminated
25, Accelerator kernel generated
Generating Tesla code
25, Generating reduction(+:gpu_count)
26, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
29, #pragma acc loop seq
29, Loop carried scalar dependence for gpu_count at line 30
% a.out
52
52
52
52
52
gpu_count: 5

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}

"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.

In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);

This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}

The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

OpenMP - for loop thread assignment

Suppose I have an array with indices 0..n-1. Is there a way to choose which cells each thread would handle? e.g. thread 0 would handle cells 0 and 5 , thread 1 would handle cells 1 and 6 and so on..

Have you looked at the schedule clause for the parallel for?
#pragma omp for schedule(static, 1)
should implement what you want, you can experiment with the schedule clause using the following simple code:
#include<stdio.h>
#include<omp.h>
int main(){
int i,th_id;
#pragma omp parallel for schedule(static,1)
for ( i = 0 ; i < 10 ; ++i){
th_id = omp_get_thread_num();
printf("Thread %d is on %d\n",th_id,i);
}
}

You can even be more explicit:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int ith = omp_get_thread_num();
for (int i=ith; i<n; i+=nth)
{
// handle cell i.
}
}
this should do exactly what you want: thread ith handles cell ith, ith+nth, ith+2*nth, ith+3*nth and so on.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

OpenMP: Having threads execute a for loop in order - c

Related

Confused with the OpenMP threadprivate variable's initialization

error with threadprivate in OpenMP

Modifying loop variable (index) in OpenACC

Multithreaded program outputs different results every time it runs

OpenMP - for loop thread assignment

Categories

Resources