Confused with the OpenMP threadprivate variable's initialization

Confused with the OpenMP threadprivate variable's initialization - c

I'm learning OpenMP these days and I just met the "threadprivate" directive. The code snippet below written by myself didn't output the expected result:
// **** File: fun.h **** //
void seed(int x);
int drand();
// ********************* //
// **** File: fun.c **** //
extern int num;
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
// **** File: main.c **** //
#include "fun.h"
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int num = 0;
#pragma omp threadprivate(num)
int main()
{
int num_inital = 4;
seed(num_inital);
printf("At the beginning, num = %d\n", num); // should num be 4?
#pragma omp parallel for num_threads(2) schedule(static,1) copyin(num)
for (int ii = 0; ii < 4; ii++) {
int my_rank = omp_get_thread_num();
//printf("Before processing, in thread %d num = %d\n", my_rank,num);
int num_in_loop = drand();
printf("Thread %d is processing loop %d: num = %d\n", my_rank,ii, num_in_loop);
}
system("pause");
return 0;
}
// ********************* //
Here list my questions:
Why the result of printf("At the beginning, num = %d\n", num); is num = 0 instead of num = 4?
As for the parallel for loop, multiple executions produce different results one of which is:
Thread 1 is processing loop 1: num = 5
Thread 0 is processing loop 0: num = 6
Thread 1 is processing loop 3: num = 7
Thread 0 is processing loop 2: num = 8
It seems that num is initialized to 4 in the for loop which denotes that the num in copyin clause is equal to 4. Why num in printf("At the beginning, num = %d\n", num) is different from that in copyin?
In OpenMP website, it said
In parallel regions, references by the master thread will be to the copy of the variable in the thread that encountered the parallel region.
According to this explanation, Thread 0 (the master thread) should firstly contains num = 4. Therefore, loop 0's output should always be: Thread 0 is processing loop 0: num = 5. Why the result above is different?
My working environment is win10 operating system with VS2015.

I think the problem is within the fun.c compilation unit. The compiler cannot determine the extern int num; variable is also a TLS one.
I will include directive #pragma omp threadprivate(num) in this file:
// **** File: fun.c **** //
extern int num;
#pragma omp threadprivate(num)
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
In any case, the compiler should warn about it at the linking phase.

The copyin clause is meant to be used in OpenMP teams (eg. computation on computing accelerators).
Indeed, the OpenMP documentation says:
These clauses support the copying of data values from private or threadprivate variables on one implicit task or thread to the corresponding variables on other implicit tasks or threads in the team.
Thus, in you case, you should rather use the clause firstprivate.
Please note that the version (5.0) of the OpenMP documentation your are reading is probably not supported by VS2015. I advise you to read an older version compatible with VS2015. The results of the compiled program are likely to be undefined.

Related

OpenMP: Having threads execute a for loop in order

I'd like to run something like the following:
for (int index = 0; index < num; index++)
I'd want to run the for loop with four threads, with the threads executing in the order: 0,1,2,3,4,5,6,7,8, etc...
That is, for the threads to be working on index =n,(n+1),(n+2),(n+3) (in any particular ordering but always in this pattern), I want iterations of index = 0,1,2,...(n-1) to already be finished.
Is there a way to do this? Ordered doesn't really work here as making the body an ordered section would basically remove all parallelism for me, and scheduling doesn't seem to work because I don't want a thread to be working on threads k->k+index/4.
Thanks for any help!

You can do this with, not a parallel for loop, but a parallel region that manages its own loop inside, plus a barrier to make sure all running threads have hit the same point in it before being able to continue. Example:
#include <stdatomic.h>
#include <stdio.h>
#include <omp.h>
int main()
{
atomic_int chunk = 0;
int num = 12;
int nthreads = 4;
omp_set_num_threads(nthreads);
#pragma omp parallel shared(chunk, num, nthreads)
{
for (int index; (index = atomic_fetch_add(&chunk, 1)) < num; ) {
printf("In index %d\n", index);
fflush(stdout);
#pragma omp barrier
// For illustrative purposes only; not needed in real code
#pragma omp single
{
puts("After barrier");
fflush(stdout);
}
}
}
puts("Done");
return 0;
}
One possible output:
$ gcc -std=c11 -O -fopenmp -Wall -Wextra demo.c
$ ./a.out
In index 2
In index 3
In index 1
In index 0
After barrier
In index 4
In index 6
In index 5
In index 7
After barrier
In index 10
In index 9
In index 8
In index 11
After barrier
Done

I'm not sure I understand your request correctly. If I try to summarize how I interpret it, that would be something like: "I want 4 threads sharing the iterations of a loop, with always the 4 threads running at most on 4 consecutive iterations of the loop".
If that's what you want, what about something like this:
int nths = 4;
#pragma omp parallel num_thread( nths )
for( int index_outer = 0; index_outer < num; index_outer += nths ) {
int end = min( index_outer + nths, num );
#pragma omp for
for( int index = index_outer; index < end; index++ ) {
// the loop body just as before
} // there's a thread synchronization here
}

why is this c code causing a race condition?

I'm trying to count the number of prime numbers up to 10 million and I have to do it using multiple threads using Posix threads(so, that each thread computes a subset of 10 million). However, my code is not checking for the condition IsPrime. I'm thinking this is due to a race condition. If it is what can I do to ameliorate this issue?
I've tried using a global integer array with k elements but since k is not defined it won't let me declare that at the file scope.
I'm running my code using gcc -pthread:
/*
Program that spawns off "k" threads
k is read in at command line each thread will compute
a subset of the problem domain(check if the number is prime)
to compile: gcc -pthread lab5_part2.c -o lab5_part2
*/
#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
typedef int bool;
#define FALSE 0
#define TRUE 1
#define N 10000000 // 10 Million
int k; // global variable k willl hold the number of threads
int primeCount = 0; //it will hold the number of primes.
//returns whether num is prime
bool isPrime(long num) {
long limit = sqrt(num);
for(long i=2; i<=limit; i++) {
if(num % i == 0) {
return FALSE;
}
}
return TRUE;
}
//function to use with threads
void* getPrime(void* input){
//get the thread id
long id = (long) input;
printf("The thread id is: %ld \n", id);
//how many iterations each thread will have to do
int numOfIterations = N/k;
//check the last thread. to make sure is a whole number.
if(id == k-1){
numOfIterations = N - (numOfIterations * id);
}
long startingPoint = (id * numOfIterations);
long endPoint = (id + 1) * numOfIterations;
for(long i = startingPoint; i < endPoint; i +=2){
if(isPrime(i)){
primeCount ++;
}
}
//terminate calling thread.
pthread_exit(NULL);
}
int main(int argc, char** args) {
//get the num of threads from command line
k = atoi(args[1]);
//make sure is working
printf("Number of threads is: %d\n",k );
struct timespec start,end;
//start clock
clock_gettime(CLOCK_REALTIME,&start);
//create an array of threads to run
pthread_t* threads = malloc(k * sizeof(pthread_t));
for(int i = 0; i < k; i++){
pthread_create(&threads[i],NULL,getPrime,(void*)(long)i);
}
//wait for each thread to finish
int retval;
for(int i=0; i < k; i++){
int * result = NULL;
retval = pthread_join(threads[i],(void**)(&result));
}
//get the time time_spent
clock_gettime(CLOCK_REALTIME,&end);
double time_spent = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec)/1000000000.0f;
printf("Time tasken: %f seconds\n", time_spent);
printf("%d primes found.\n", primeCount);
}
the current output I am getting: (using the 2 threads)
Number of threads is: 2
Time tasken: 0.038641 seconds
2 primes found.

The counter primeCount is modified by multiple threads, and therefore must be atomic. To fix this using the standard library (which is now supported by POSIX as well), you should #include <stdatomic.h>, declare primeCount as an atomic_int, and increment it with an atomic_fetch_add() or atomic_fetch_add_explicit().
Better yet, if you don’t care about the result until the end, each thread can store its own count in a separate variable, and the main thread can add all the counts together once the threads finish. You will need to create, in the main thread, an atomic counter per thread (so that updates don’t clobber other data in the same cache line), pass each thread a pointer to its output parameter, and then return the partial tally to the main thread through that pointer.
This looks like an exercise that you want to solve yourself, so I won’t write the code for you, but the approach to use would be to declare an array of counters like the array of thread IDs, and pass &counters[i] as the arg parameter of pthread_create() similarly to how you pass &threads[i]. Each thread would need its own counter. At the end of the thread procedure, you would write something like, atomic_store_explicit( (atomic_int*)arg, localTally, memory_order_relaxed );. This should be completely wait-free on all modern architectures.
You might also decide that it’s not worth going to that trouble to avoid a single atomic update per thread, declare primeCount as an atomic_int, and then atomic_fetch_add_explicit( &primeCount, localTally, memory_order_relaxed ); once before the thread procedure terminates.

error with threadprivate in OpenMP

I ran an example of OpenMP program to test the function of threadprivate, but the result is unexpected and nondetermined. (The program is run on ubuntu 14.04 in VMware.)
the source code is shown in the following:
int counter = 0;
#pragma omp threadprivate (counter)
int inc_counter()
{
counter++;
return counter;
}
void main()
{
#pragma omp parallel sections copyin(counter)
{
#pragma omp section
{
int count1;
for(int iter = 0; iter < 100; ++iter)
count1 = inc_counter();
printf("count1 = %1d\n", count1);
}
#pragma omp section
{
int count2;
for(int iter=0; iter<200; iter++)
count2 = inc_counter();
printf("count2 = %d\n", count2);
}
}
printf("counter = %d\n", counter);
}
The output of the program is:
The correct output should be:
count1 = 100
count2 = 200
counter = 0/100/200?
What's wrong with it?

threadprivate differs from private in the sense that it doesn't have to allocate local variable for each thread, the variable only have to be unique for each thread. One of you threads (master) uses global counter defined in int counter = 0;. Therefore this thread changes this value to 100, 200 or leave it unchanged (0), depending on what section this single thread started to execute.
As you highlighted it seems weird for you why program is giving next results for (count1,count2,counter): (100,300,300) and (100,300,0).
(100,300,300)
Master thread executes both sections. You can check it by launching your code with single thread: OMP_NUM_THREADS=1 ./ex_omp
(100,300,0)
Some thread execute both sections while master is idle. You can check it by introducing section (alongside your two):
#pragma omp section
{
sleep(1); // #include <unistd.h>
printf("hope it is 0 (master) -> %d\n", omp_get_thread_num());
}
If you have 2 threads and master starts to execute this section then another thread with high probability execute your two other sections and you will get (100,300,0) as expected. Launch for example as OMP_NUM_THREADS=2 ./ex_omp.
If it still seems wrong that count2 = 300 you should notice that count is not private for section, it is private for a thread, while this thread can execute both sections.

Modifying loop variable (index) in OpenACC

I have a situation that I need to repeat a specific iteration of the loop multiple times. So, in that specific iteration, I am reducing the index one step so that next increment of the loop index makes no difference.
This approach, which is the approach I have to implement, works for multi-threaded OpenMP codes. However, it does not work for OpenACC (for both multicore and tesla targets). I get the following error:
Floating point exception (core dumped)
Here is the code for both of cases:
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
int main() {
int x = 52;
int count = 5;
int i;
omp_set_num_threads(6);
#pragma omp parallel for
for(i=0;i<100;i++) {
if(i == x) {
printf("%d\n", i);
i--;
count--;
if(count == 0)
x = 10000;
}
}
int gpu_count = 0;
count = 5;
x = 52;
#pragma acc parallel loop independent
for(i=0;i<1000000;i++) {
if(i == x) {
#pragma acc atomic
gpu_count++;
i--;
count--;
if(count == 0)
x = 2000000;
}
}
printf("gpu_count: %d\n", gpu_count);
return 0;
}
For OpenMP, I get the correct output:
52
52
52
52
52
But, for the OpenACC, I get the abovementioned error.
If I comment line 35 (i--;), the code will be executed correctly and it will output number of repeated iterations (which is 1).
Note: I am using PGI 16.5 with Geforce GTX 970 and CUDA 7.5.
I compile with PGI compiler like following:
pgcc -mp -acc -ta=multicore -g f1.c
So, my question is: why I see such a behavior? Can't I change the loop index variable in OpenACC?

Your OpenMP version is in error. You're relying on a static schedule where the chunk size is larger than "count". If you increase the number of OMP threads so the chunk size is smaller than count, or if you change the schedule to interleave the chunks (i.e. "schedule(static,1)"), then you'll get wrong answers. There's also race conditions on "x" and "count".
Note that OpenACC scheduling is more like OpenMP "static,1" so that vectors can access contiguous blocks of memory across a worker (aka a CUDA warp). So your algorithm wont work here as well.
Also, by using the "independent" clause (which is implied when using "parallel loop"), you are asserting to the compiler that this loop does not contain dependencies or that the user will handle them via the "atomic" directive. However, changing the loop index variable inside the body of the loop will create a loop dependency since the value of the loop index depends on if the previous iteration changed it's value.
Edit: Below is an example which is a parallelizable version of your code.
% cat test2.c
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
int main() {
int x = 52;
int count = 5;
int i;
int mycnt;
#pragma omp parallel for schedule(static,1) private(mycnt)
for(i=0;i<100;i++) {
if(i == x) {
mycnt = count;
while(mycnt > 0) {
printf("%d\n", i);
mycnt--;
}
}
}
#ifdef _OPENACC
int gpu_count = 0;
#pragma acc parallel loop reduction(+:gpu_count)
for(i=0;i<1000000;i++) {
if(i == x) {
mycnt = count;
while(mycnt > 0) {
gpu_count++;
mycnt--;
}
}
}
printf("gpu_count: %d\n", gpu_count);
#endif
return 0;
}
% pgcc -fast -mp -acc test2.c -Minfo=mp,acc
main:
13, Parallel region activated
Parallel loop activated with static cyclic schedule
24, Barrier
Parallel region terminated
25, Accelerator kernel generated
Generating Tesla code
25, Generating reduction(+:gpu_count)
26, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
29, #pragma acc loop seq
29, Loop carried scalar dependence for gpu_count at line 30
% a.out
52
52
52
52
52
gpu_count: 5

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}

"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.

In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);

This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}

The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Confused with the OpenMP threadprivate variable's initialization - c

Related

OpenMP: Having threads execute a for loop in order

why is this c code causing a race condition?

error with threadprivate in OpenMP

Modifying loop variable (index) in OpenACC

Multithreaded program outputs different results every time it runs

Categories

Resources