How to synchronize 3 nested loop in OpenMP? - c

I am writing a program that will match up one block(a group of 4 double numbers which are within certain absolute value) with another.
Essentially, I will call the function in main.
The matrix has 4399 rows and 500 columns.I am trying to use OpenMp to speed up the task yet my code seems to have race condition within the innermost loop (where the actual creation of block happens create_Block(rrr[k], i); ).
It is ok to ignore all the function detail as they are working well in serial version. The only focus here is the OpenMP derivatives.
int main(void) {
readKey("keys.txt");
double** jz = readMatrix("data.txt");
int j = 0;
int i = 0;
int k = 0;
#pragma omp parallel for firstprivate(i) shared(Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (i = 0; i < 50; i++) {
printf("THIS IS COLUMN %d\n", i);
double*c = readCol(jz, i, 4400);
#pragma omp parallel for firstprivate(j) shared(i,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (j=0; j < 4400; j++) {
// printf("This is fixed row %d from column %d !!!!!!!!!!\n",j,i);
int* one_collection = collection(c, j, 4400);
// MODIFY THE DYMANIC ALLOCATION OF SPACES (SIZE_OF_COMBINATION) IN combNonRec() function.
if (get_combination_size(SIZE_OF_COLLECTION, M) >= 4) {
//GET THE 2D-ARRAY OF COMBINATION
int** rrr = combNonRec(one_collection, SIZE_OF_COLLECTION, M);
#pragma omp parallel for firstprivate(k) shared(i,j,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (k = 0; k < get_combination_size(SIZE_OF_COLLECTION, M); k++) {
create_Block(rrr[k], i); //ACTUAL CREATION OF BLOCK !!!!!!!
printf("This is block %d \n", NUM_OF_BLOCK);
add_To_Block_Collection();
}
free(rrr);
}
free(one_collection);
}
//OpenMP for j
free(c);
}
// OpenMP for i
collision();
}
Here is the parallel version result: non-deterministic
Whereas the serial result has constant 400 blocks.
Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION are global variable.
Did I do anything wrong in the derivative declaration? What might have caused such problem?

Related

Search of max value and index in a vector

I'm trying to parallelize this piece of code that search for a max on a column.
The problem is that the parallelize version runs slower than the serial
Probably the search of the pivot (max on a column) is slower due the syncrhonization on the maximum value and the index, right?
int i,j,t,k;
// Decrease the dimension of a factor 1 and iterate each time
for (i=0, j=0; i < rwA, j < cwA; i++, j++) {
int i_max = i; // max index set as i
double matrixA_maxCw_value = fabs(matrixA[i_max][j]);
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
if (matrixA[i_max][j] == 0) {
j++; //Check if there is a pivot in the column, if not pass to the next column
}
else {
//Swap the rows, of A, L and P
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
lupFactorization(matrixA,L,i,j,rwA);
}
}
void swapRows(double **matrixA, int i, int j, int i_max) {
double temp_val = matrixA[i][j];
matrixA[i][j] = matrixA[i_max][j];
matrixA[i_max][j] = temp_val;
}
I do not want a different code but I want only know why this happens, on a matrix of dimension 1000x1000 the serial version takes 4.1s and the parallelized version 4.28s
The same thing (the overhead is very small but there is) happens on the swap of the rows that theoretically can be done in parallel without problem, why it happens?
There is at least two things wrong with your parallelization
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
You are getting the biggest index of all of them, but that does not mean that it belongs to the max value. For instance looking at the following array:
[8, 7, 6, 5, 4 ,3, 2 , 1]
if you parallelized with two threads, the first thread will have max=8 and index=0, the second thread will have max=4 and index=4. After the reduction is done the max will be 8 but the index will be 4 which is obviously wrong.
OpenMP has in-build reduction functions that consider a single target value, however in your case you want to reduce taking into account 2 values the max and the array index. After OpenMP 4.0 one can create its own reduction functions (i.e., User-Defined Reduction).
You can have a look at a full example implementing such logic here
The other issue is this part:
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
You are swapping those elements in parallel, which leads to inconsistent state.
First you need to solve those issue before analyzing why your code is not having speedups.
First correctness then efficiency. But don't except much speedups with the current implementation, the amount of computation performed in parallelism is that much to justify the overhead of the parallelism.

OpenMP Parallelize code inside a for loop

I want to parallelize tasks inside a for loop using OpenMP. However, I do not want to use #pragma omp parallel for as the result of the (i+1)th iteration depends on the output of the (i)th iteration. I have tried to spawn the threads inside the code, but the time of creating and destroying them every time is very high. An abstract description of my code is:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
for (int i=0; i<1000; i++)
{
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
b_new = fun(b_old);
b_old = b_new;
c_new = fun(c_old);
c_old = c_new;
d_new = fun(d_old);
d_old = d_new;
}
How can I efficiently use threads to calculate the new values of a_new, b_new, c_new, d_new in parallel in each iteration ?
Just don't parallelize the code inside the for loop - move the parallel region to the outside. This reduces the thread creation and worksharing overhead. Then you can easily apply OpenMP sections:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
#pragma omp parallel sections
{
#pragma omp section
for (int i=0; i<1000; i++) {
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
b_new = fun(b_old);
b_old = b_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
c_new = fun(c_old);
c_old = c_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
d_new = fun(d_old);
d_old = d_new;
}
}
There is also another simplification:
int value[4];
#pragma omp parallel for
for (int abcd = 0; abcd < 4; abcd++) {
for (int i=0; i<1000; i++) {
value[abcd] = fun(value[abcd]);
}
}
In either case, you might want to consider adding padding between the values to avoid false sharing if fun executes rather quickly.
This pretty straight forward, as #kbr mentioned in the comments each of the calculation a,b,c and d are independent, so you can separate them to different threads and pass the corresponding value as parameter. The sample code looks like this.
#include<stdio.h>
#include <pthread.h>
void *thread_func(int *i)
{
for (int j=0; j<1000; j++)
{
//Instead of increment u can call whichever function you want here.
(*i)++;
}
}
int main()
{
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
pthread_t thread[4];
pthread_create(&thread[0],0,thread_func,&a_old);
pthread_create(&thread[1],0,thread_func,&b_old);
pthread_create(&thread[2],0,thread_func,&c_old);
pthread_create(&thread[3],0,thread_func,&d_old);
pthread_join(&thread[0],NULL);
pthread_join(&thread[1],NULL);
pthread_join(&thread[2],NULL);
pthread_join(&thread[3],NULL);
printf("a_old %d",a_old);
printf("b_old %d",b_old);
printf("c_old %d",c_old);
printf("d_old %d",d_old);
}

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

What to heed, when reading an array from multiple threads?

I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).

Longest Common Subsequence with openMP

I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.
#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis
The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).

Resources