I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).
Related
I have a program in .C that uses openmp that can be seen below; the program is used to compute pi given a set of steps; however, I am new to openMp, so my knowledge is limited.
I'm attempting to implement a barrier for this program, but I believe one is already implicit, so I'm not sure if I even need to implement it.
Thank you!
#include <omp.h>
#include <stdio.h>
#define NUM_THREADS 4
static long num_steps = 100000000;
double step;
int main()
{
int i;
double start_time, run_time, pi, sum[NUM_THREADS];
omp_set_num_threads(NUM_THREADS);
step = 1.0 / (double)num_steps;
start_time = omp_get_wtime();
#pragma omp parallel
{
int i, id, currentThread;
double x;
id = omp_get_thread_num();
currentThread = omp_get_num_threads();
for (i = id, sum[id] = 0.0; i < num_steps; i = i + currentThread)
{
x = (i + 0.5) * step;
sum[id] = sum[id] + 4.0 / (1.0 + x * x);
}
}
run_time = omp_get_wtime() - start_time;
//we then get the value of pie
for (i = 0, pi = 0.0; i < NUM_THREADS; i++)
{
pi = pi + sum[i] * step;
}
printf("\n pi with %ld steps is %lf \n ", num_steps, pi);
printf("run time = %6.6f seconds\n", run_time);
}
In your case there is no need for an explicit barrier, there is an implicit barrier at the end of the parallel section.
Your code, however, has a performance issue. Different threads update adjacent elements of sum array which can cause false sharing:
When multiple threads access same cache line and at least one of them
writes to it, it causes costly invalidation misses and upgrades.
To avoid it you have to be sure that each element of the sum array is located on a different cache line, but there is a simpler solution: to use OpenMP's reduction clause. Please check this example suggested by #JeromeRichard. Using reduction your code should be something like this:
double sum=0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < num_steps; i++)
{
const double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x * x);
}
Note also that you should use your variables in their minimum required scope.
I am trying to understand how OMP treats different for loop declarations. I have:
int main()
{
int i, A[10000]={...};
double ave = 0.0;
#pragma omp parallel for reduction(+:ave)
for(i=0;i<10000;i++){
ave += A[i];
}
ave /= 10000;
printf("Average value = %0.4f\n",ave);
return 0;
}
where {...} are the numbers form 1 to 10000. This code prints the correct value. If instead of #pragma omp parallel for reduction(+:ave) I use #pragma omp parallel for private(ave) the result of the printf is 0.0000. I think I understand what reduction(oper:list) does, but was wondering if it can be substituted for private and how.
So yes, you can do reductions without the reduction clause. But that has a few downsides that you have to understand:
You have to do things by hand, which is more error-prone:
declare local variables to store local accumulations;
initialize them correctly;
accumulate into them;
do the final reduction in the initial variable using a critical construct.
This is harder to understand and maintain
This is potentially less effective...
Anyway, here is an example of that using your code:
int main() {
int i, A[10000]={...};
double ave = 0.0;
double localAve;
#pragma omp parallel private( i, localAve )
{
localAve = 0;
#pragma omp for
for( i = 0; i < 10000; i++ ) {
localAve += A[i];
}
#pragma omp critical
ave += localAve;
}
ave /= 10000;
printf("Average value = %0.4f\n",ave);
return 0;
}
This is a classical method for doing reductions by hand, but notice that the variable that would have been declared reduction isn't declared private here. What becomes private is a local substitute of this variable while the global one must remain shared.
I have been trying to parallelize computing the sum value of series using a certain number of terms to the processors using block allocation.
In this program, I am generating arithmetic series and want to pass array as a shared variable in the pragma parallel directive.
but error coming in line #pragma omp parallel num_threads(comm_sz, number, BLOCK_LOW, BLOCK_HIGH, a[n], first, difference, global_sum1) providing the error below
expected a)a before a[a token
I am new to OPENMP-C. I have written th code below and facing the error above. -I researched in google but unable to find the solution.
Kindly help me how to declare array as shared variable in pragma parallel directive. I am attaching the code below
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main (int argc, char *argv[])
{
int rank, comm_sz;
int number, i, first, difference, global_sum1, global_sum, nprocs, step, local_sum1, local_n;
int* a;
int BLOCK_LOW, BLOCK_HIGH;
double t0, t1;
comm_sz = atoi(argv[1]);
first = atoi(argv[2]);
difference = atoi(argv[3]);
number = atoi(argv[4]);
omp_set_num_threads (comm_sz);
rank = omp_get_thread_num();
a = (int*) malloc (n*sizeof(int));
printf("comm_sz=%d, first=%d, difference=%d, number of terms=%d\n",comm_sz, first, difference, number);
for(i=1; i <= number; i++){
a[i-1] = first + (i-1)*difference;
printf("a[%d]=%d\n",i-1,a[i]);
}
for(i=0; i < number; i++){
printf("a[%d]=%d\n",i,a[i]);}
t0 = omp_get_wtime();
#pragma omp parallel num_threads(comm_sz, number, BLOCK_LOW, BLOCK_HIGH, a[n], first, difference, global_sum1)
{
BLOCK_LOW = (rank * number)/comm_sz;
BLOCK_HIGH = ((rank+1) * number)/comm_sz;
#pragma omp parallel while private(i, local_sum1)
//int local_sum1 = 0;
i=BLOCK_LOW;
while( i < BLOCK_HIGH )
{
printf("%d, %d\n",BLOCK_LOW,BLOCK_HIGH);
local_sum1 = local_sum1 + a[i];
i++;
}
//global_sum1 = global_sum1 + local_sum1;
#pragma omp while reduction(+:global_sum1)
i=0;
for (i < comm_sz) {
global_sum1 = global_sum1 + local_sum1;
i++;
}
}
step = 2*first + (n-1)*difference;
sum = 0.5*n*step;
printf("sum is %d\n", global_sum );
t1 = omp_get_wtime();
printf("Estimate of pi: %7.5f\n", global_sum1);
printf("Time: %7.2f\n", t1-t0);
}
I'm just started with OpenMP and I need help.
I have a program and I need to parallelize it. This is what I have:
#include <stdio.h>
#include <sys/time.h>
#include <omp.h>
#define N1 3000000
#define it 5
struct timeval t0, t1;
int i, itera_kop;
int A[N1], B[N1];
void Exe_Denbora(char * pTestu, struct timeval *pt0, struct timeval *pt1)
{
double tej;
tej = (pt1->tv_sec - pt0->tv_sec) + (pt1->tv_usec - pt0->tv_usec) / 1e6;
printf("%s = %10.3f ms (%d hari)\n",pTestu, tej*1000, omp_get_max_threads());
}
void sum(char * pTestu, int *b, int n)
{
double bat=0;
int i;
for (i=0; i<n; i++) bat+=b[i];
printf ("sum: %.1f\n",bat);
}
main ()
{
for (itera_kop=1;itera_kop<it;itera_kop++)
{
for(i=0; i<N1; i++)
{
A[i] = 1;
B[i] = 3;
}
gettimeofday(&t0, 0);
#pragma omp parallel for private(i)
for(i=2; i<N1; i++)
{
A[i] = 35 / (7/B[i-1] + 2/A[i]);
B[i] = B[i] / (A[i-1]+2) + 3 / B[i];
}
gettimeofday(&t1, 0);
Exe_Denbora("T1",&t0,&t1);
printf ("\n");
}
printf("\n\n");
sum("A",A,N1);
sum("B",B,N1);
}
If I execute the code without using #pragma omp parallel for I get:
A sum: 9000005.5
B sum: 3000005.5
But if I try to parallelize the code I get:
A sum: 9000284.0
B sum: 3000036.0
using 32 threads.
I would like to know why I can't parallelize the code that way
As you are likely aware, your problem is in this for loop. You have dependency between the two lines in the loop.
for(i=2; i<N1; i++)
{
A[i] = 35 / (7/B[i-1] + 2/A[i]);
B[i] = B[i] / (A[i-1]+2) + 3 / B[i];
}
We cannot know the order in which any given thread reaches one of those two lines. Therefore, as an example, when the second line executes, the value in B[i] will be different depending on if A[i-1] has already been changed or not by another thread. The same can be said of A[i]'s dependency on the value of B[i-1]. A short and clear explanation of dependencies can be found at the following link. I would recommend you take a look if this still is not clear. https://scs.senecac.on.ca/~gpu621/pages/content/omp_2.html
I am writing a program that will match up one block(a group of 4 double numbers which are within certain absolute value) with another.
Essentially, I will call the function in main.
The matrix has 4399 rows and 500 columns.I am trying to use OpenMp to speed up the task yet my code seems to have race condition within the innermost loop (where the actual creation of block happens create_Block(rrr[k], i); ).
It is ok to ignore all the function detail as they are working well in serial version. The only focus here is the OpenMP derivatives.
int main(void) {
readKey("keys.txt");
double** jz = readMatrix("data.txt");
int j = 0;
int i = 0;
int k = 0;
#pragma omp parallel for firstprivate(i) shared(Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (i = 0; i < 50; i++) {
printf("THIS IS COLUMN %d\n", i);
double*c = readCol(jz, i, 4400);
#pragma omp parallel for firstprivate(j) shared(i,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (j=0; j < 4400; j++) {
// printf("This is fixed row %d from column %d !!!!!!!!!!\n",j,i);
int* one_collection = collection(c, j, 4400);
// MODIFY THE DYMANIC ALLOCATION OF SPACES (SIZE_OF_COMBINATION) IN combNonRec() function.
if (get_combination_size(SIZE_OF_COLLECTION, M) >= 4) {
//GET THE 2D-ARRAY OF COMBINATION
int** rrr = combNonRec(one_collection, SIZE_OF_COLLECTION, M);
#pragma omp parallel for firstprivate(k) shared(i,j,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (k = 0; k < get_combination_size(SIZE_OF_COLLECTION, M); k++) {
create_Block(rrr[k], i); //ACTUAL CREATION OF BLOCK !!!!!!!
printf("This is block %d \n", NUM_OF_BLOCK);
add_To_Block_Collection();
}
free(rrr);
}
free(one_collection);
}
//OpenMP for j
free(c);
}
// OpenMP for i
collision();
}
Here is the parallel version result: non-deterministic
Whereas the serial result has constant 400 blocks.
Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION are global variable.
Did I do anything wrong in the derivative declaration? What might have caused such problem?