OpenMP multiple writing on dynamic array C - c

By using OpenMP I'm trying to parallelize the creation of a kind of dictionary so defined.
typedef struct Symbol {
int usage;
char character;
} Symbol;
typedef struct SymbolDictionary {
int charsNr;
Symbol *symbols;
} SymbolDictionary;
I did the following code.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <omp.h>
static const int n = 10;
int main(int argc, char* argv[]) {
int thread_count = strtol(argv[1], NULL, 10);
omp_set_dynamic(0);
omp_set_num_threads(thread_count);
SymbolDictionary **symbolsDict = calloc(omp_get_max_threads(), sizeof(SymbolDictionary*));
SymbolDictionary *dict = NULL;
int count = 0;
#pragma omp parallel for firstprivate(dict, count) shared(symbolsDict)
for (int i = 0; i < n; i++) {
if (count == 0) {
dict = calloc(1, sizeof(SymbolDictionary));
dict->charsNr = 0;
dict->symbols = calloc(n, sizeof(Symbol));
#pragma omp critical
symbolsDict[omp_get_thread_num()] = dict;
}
dict->symbols[count].usage = i;
dict->symbols[count].character = 'a' + i;
++dict->charsNr;
++count;
}
if (omp_get_max_threads() > 1) {
// merge the dictionaries
}
for (int j = 0; j < symbolsDict[0]->charsNr; j++)
printf("symbolsDict[0][%d].character: %c\nsymbolsDict[0][%d].usage: %d\n",
j,
symbolsDict[0]->symbols[j].character,
j,
symbolsDict[0]->symbols[j].usage);
for (int i = 0; i < omp_get_max_threads(); i++)
free(symbolsDict[i]->symbols);
free(symbolsDict);
return 0;
}
The code compiles and runs, but I'm not sure about how the omp block works and if I implemented it correctly. Especially I have to attach the dict with the symbolsDict at the beginning of the loop, because I don't know when a thread will complete its work. However, by doing that probably different threads will write inside symbolsDict at the same time but in different memory. Although the threads will use different access points, dict should be different for every thread, I'm not sure this is a good way to do that.
I tested the code with different threads and creating dictionaries of different sizes.
I didn't have any kind of problem, but maybe it was just chance.
Basically I looked for the theory part around on the documentation. So I would like to know if I implemented the code correctly? If not, what is incorrect and why?

different threads will write inside symbolsDict at the same time but in different memory. Although the threads will use different access points, dict should be different for every thread, I'm not sure this is a good way to do that.
It isn't a good way but it is safe. A cleaner way would be this:
SymbolDictionary **symbolsDict = calloc(
omp_get_max_threads(), sizeof(SymbolDictionary*));
#pragma omp parallel
{
SymbolDictionary *dict = calloc(1, sizeof(SymbolDictionary));
int count = 0;
dict->charsNr = 0;
dict->symbols = calloc(n, sizeof(Symbol));
symbolsDict[omp_get_thread_num()] = dict;
# pragma omp for nowait
for(int i = 0; i < n; i++) {
dict->symbols[count].usage = i;
dict->symbols[count].character = 'a' + i;
++dict->charsNr;
++count;
}
}
Note that the inner pragma is omp for, not omp parallel for so it is using the outer parallel block to distribute its work. The nowait is a performance improvement that avoids a thread barrier at the end of the loop since it is the last part of the parallel section and threads wait for all other threads at the end of the section anyway.

Related

Concatenate private arrays after OpenMP

I am using OpenMP to parallelize the code below.
int *neighbors = malloc(num_of_neigh * sizeof(int));
int pos = 0;
for(int i=0;i<n;i++){
if(M[n*element+i]==1 && element!=i){
neighbors[pos] = i;
pos++;
}
}
Below is the code which works properly. However for a great number of n, I have a slower implementation due to the shared variables neighbors and pos.
int *neighbors = malloc(num_of_neigh * sizeof(int));
int pos = 0;
#pragma omp parallel for num_threads(8)
for(int i=0;i<n;i++){
if(M[n*element+i]==1 && element!=i){
#pragma omp critical
{
neighbors[pos] = i;
pos++;
}
}
}
As a result, I decided to use local variables for neighbors and pos and concatenate the private variables to a global one after the calculations. However, I have some troubles with these concatenations.
int *neighbors = malloc(num_of_neigh * sizeof(int));
int pos = 0;
#pragma omp parallel num_threads(8)
{
int *temp_neighbors = malloc(num_of_neigh * sizeof(int));
int temp_pos = 0;
#pragma omp for
for(int i=0;i<n;i++){
if(M[n*element+i]==1 && element!=i){
temp_neighbors[temp_pos] = i;
temp_pos++;
}
}
// I want here to concatenate the local variables temp_neighbors to the global one neighbors.
}
I tried the code below in order to achieve the concatenation, but neighbors only takes the last value of the temp_neighbors, while the rest elements are 0.
#pragma omp critical
{
memcpy(neighbors+pos, temp_neighbors, num_of_neigh * sizeof(neighbors));
pos++;
}
So, the question is:
How can I concatenate private variables (and especially arrays) to a global one?
I searched a lot, but didn't find any proper answer. Thanks in advance and sorry for the big question.
The global section should look like the following.
#pragma omp critical
{
memcpy(neighbors + pos, temp_neighbors, temp_pos * sizeof(int));
pos += temp_pos;
}
memcpy copies temp_pos elements (temp_pos * sizeof(int) bytes) of data to first unused position in neighbors (neighbors + pos). Then pos is increased to be index of next unused position.

How to return each thread's output into an array using OpenMP?

I would like to proceed a multi-thread program where each thread outputs an array of unknown number of elements.
For example, select all numbers that < 10 from an int array and put them into a new array.
Pseudo code (8 threads):
int *hugeList = malloc(10000000);
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
long *subList[8];//to fill each thread's result
#pragma omp parallel
for (long i = 0; i < 1000000; ++i)
{
long n = 0;
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
subList[threadNo][n] = hugeList[i];
n++;
}
}
Array "subList" should collect the elements in "hugeList" which satisfies condition (<10) ,sequentially and in terms of thread number.
How should I write the code? It is OK if there is a better way using OpenMP.
There are several problems in your code.
1/ omp pragma should be parallel for, if you want the for loop to be parallelized. Otherwise, code will be duplicated in everay thread.
2/ code is incoherent with comment
//do something to fill "subList" properly
hugeList[i] = subList[threadNo][n];
3/ How do you know the number of element in your sublists? It must be returned to main thread. You could use an array, but beware of false sharing. Better use a local var and write it at the end the parallel section.
4/ sublist is not allocated. The difficulty is that you do not know the number of threads. You can ask omp the max number of thread (get_omp_max_thread), and do dynamic allocation. If you want some static allocation, maybe the best is to allocate a large table and to compute the actual address in every thread.
5/ omp code must also work without an openmp compiler. Use #ifdef _OPENMP for that.
Here is an (untested) way your code can be written
#define HUGE 10000000
int *hugeList = (int *) malloc(HUGE);
#ifdef _OPENMP
int thread_nbr=omp_get_max_threads();
#else
int thread_nbr=1; // to ensure proper behavior in a sequential context
#endif
struct thread_results { // to hold per thread results
int nbr; // nbr of generated results
int *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
int *subList=(int *)malloc(HUGE*sizeof(int)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results* threadres=(struct thread_results *)malloc(thread_nbr*sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num() ; // hold thread id
int thread_nbr = omp_get_num_threads() ; // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res=threadres+thread_id;
res->nbr=0;
// compute address in subList table
res->results=subList+(HUGE/thread_nbr)*thread_id;
int * res_ptr=res->results; // local pointer. Each thread points to independent part of subList table
int n=0; // number of results. We want one per thread to only have local updates.
#pragma omp for
for (long i = 0; i < 1000000; ++i)
{
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
res_ptr[n]=hugeList[i];
n++;
}
}
res->nbr=n;
}
Updated complete codes based on #Alain Merigot 's answer
I tested the following code; It is reproducible (including presence & absence of #pragma arguments).
However, only the front elements of subList are correct, while the rest are empty.
(filename.c)
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <math.h>
#define HUGE 10000000
#define DELAY 1000 //depends on your CPU power
//use global variables to store desired results, otherwise can't be obtain outside "pragma"
int n = 0;// number of results. We want one per thread to only have local updates.
double *subList;// table to hold thread results
int main()
{
double *hugeList = (double *)malloc(HUGE);
#ifdef _OPENMP
int thread_nbr = omp_get_max_threads();
#else
int thread_nbr = 1; // to ensure proper behavior in a sequential context
#endif
struct thread_results
{ // to hold per thread results
int nbr; // nbr of generated results
double *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = sin(i); //fixed array content to test reproducibility
}
subList = (double *)malloc(HUGE * sizeof(double)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results *threadres = (struct thread_results *)malloc(thread_nbr * sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num(); // hold thread id
int thread_nbr = omp_get_num_threads(); // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res = threadres + thread_id;
res->nbr = 0;
// compute address in subList table
res->results = subList + (HUGE / thread_nbr) * thread_id;
double *res_ptr = res->results; // local pointer. Each thread points to independent part of subList table
#pragma omp for reduction(+ \
: n)
for (long i = 0; i < 1000000; ++i)
{
for (int i = 0; i < DELAY; ++i){}//do nothing, just waste time
if (hugeList[i] < 0)
{
//do something to fill "subList" properly
res_ptr[n] = hugeList[i];
n++;
}
}
res->nbr = n;
}
for (int i = 0; i < 10; ++i)
{
printf("sublist %d: %lf\n", i, subList[i]);//show some elements of subList to check reproducibility
}
printf("n = %d\n", n);
}
Linux compile: gcc -o filename filename.c -fopenmp -lm
I hope there can be more discussion of the mechanism of this code.

OpenMP Parallelize code inside a for loop

I want to parallelize tasks inside a for loop using OpenMP. However, I do not want to use #pragma omp parallel for as the result of the (i+1)th iteration depends on the output of the (i)th iteration. I have tried to spawn the threads inside the code, but the time of creating and destroying them every time is very high. An abstract description of my code is:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
for (int i=0; i<1000; i++)
{
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
b_new = fun(b_old);
b_old = b_new;
c_new = fun(c_old);
c_old = c_new;
d_new = fun(d_old);
d_old = d_new;
}
How can I efficiently use threads to calculate the new values of a_new, b_new, c_new, d_new in parallel in each iteration ?
Just don't parallelize the code inside the for loop - move the parallel region to the outside. This reduces the thread creation and worksharing overhead. Then you can easily apply OpenMP sections:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
#pragma omp parallel sections
{
#pragma omp section
for (int i=0; i<1000; i++) {
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
b_new = fun(b_old);
b_old = b_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
c_new = fun(c_old);
c_old = c_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
d_new = fun(d_old);
d_old = d_new;
}
}
There is also another simplification:
int value[4];
#pragma omp parallel for
for (int abcd = 0; abcd < 4; abcd++) {
for (int i=0; i<1000; i++) {
value[abcd] = fun(value[abcd]);
}
}
In either case, you might want to consider adding padding between the values to avoid false sharing if fun executes rather quickly.
This pretty straight forward, as #kbr mentioned in the comments each of the calculation a,b,c and d are independent, so you can separate them to different threads and pass the corresponding value as parameter. The sample code looks like this.
#include<stdio.h>
#include <pthread.h>
void *thread_func(int *i)
{
for (int j=0; j<1000; j++)
{
//Instead of increment u can call whichever function you want here.
(*i)++;
}
}
int main()
{
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
pthread_t thread[4];
pthread_create(&thread[0],0,thread_func,&a_old);
pthread_create(&thread[1],0,thread_func,&b_old);
pthread_create(&thread[2],0,thread_func,&c_old);
pthread_create(&thread[3],0,thread_func,&d_old);
pthread_join(&thread[0],NULL);
pthread_join(&thread[1],NULL);
pthread_join(&thread[2],NULL);
pthread_join(&thread[3],NULL);
printf("a_old %d",a_old);
printf("b_old %d",b_old);
printf("c_old %d",c_old);
printf("d_old %d",d_old);
}

Array iteration in multiple threads

I've got the following example, let's say I want for each thread to count from 0 to 9.
void* iterate(void* arg) {
int i = 0;
while(i<10) {
i++;
}
pthread_exit(0);
}
int main() {
int j = 0;
pthread_t tid[100];
while(j<100) {
pthread_create(&tid[j],NULL,iterate,NULL);
pthread_join(tid[j],NULL);
}
}
variable i - is in a critical section, it will be overwritten multiple times and therefore threads will fail to count.
int* i=(int*)calloc(1,sizeof(int));
doesn't solve the problem either. I don't want to use mutex. What is the most common solution for this problem?
As other users are commenting, there are severals problems in your example:
Variable i is not shared (it should be a global variable, for instance), nor in a critical section (it is a local variable to each thread). To have a critical section you should use locks or transactional memory.
You don't need to create and destroy threads every iteration. Just create a number of threads at the beggining and wait for them to finish (join).
pthread_exit() is not necessary, just return from the thread function (with a value).
A counter is a bad example for threads. It requires atomic operations to avoid overwriting the value of other threads. Actually, a multithreaded counter is a typical example of why atomic accesses are necessary (see this tutorial, for example).
I recommend you to start with some tutorials, like this or this.
I also recommend frameworks like OpenMP, they simplify the semantics of multithreaded programs.
EDIT: example of a shared counter and 4 threads.
#include <stdio.h>
#include <pthread.h>
#define NUM_THREADS 4
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int counter = 0;
void* iterate(void* arg) {
int i = 0;
while(i++ < 10) {
// enter critical section
pthread_mutex_lock(&mutex);
++counter;
pthread_mutex_unlock(&mutex);
}
return NULL;
}
int main() {
int j;
pthread_t tid[NUM_THREADS];
for(j = 0; j < NUM_THREADS; ++j)
pthread_create(&tid[j],NULL,iterate,NULL);
// let the threads do their magic
for(j = 0; j < NUM_THREADS; ++j)
pthread_join(tid[j],NULL);
printf("%d", counter);
return 0;
}

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

Resources