I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.
void partition(int arr[],int arr1[],int low,int high,int thread_count)
{
int tid,mid;
#pragma omp if
if(low<high)
{
if(thread_count==1)
{
mid=(low+high)/2;
partition(arr,arr1,low,mid,thread_count);
partition(arr,arr1,mid+1,high,thread_count);
sort(arr,arr1,low,mid,high);
}
else
{
#pragma omp parallel num_threads(thread_count)
{
mid=(low+high)/2;
#pragma omp parallel sections
{
#pragma omp section
{
partition(arr,arr1,low,mid,thread_count/2);
}
#pragma omp section
{
partition(arr,arr1,mid+1,high,thread_count/2);
}
}
}
sort(arr,arr1,low,mid,high);
}
}
}
As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.
Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause
You can find a trace of such an implementation below:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
void getTime(double *t) {
struct timeval tv;
gettimeofday(&tv, 0);
*t = tv.tv_sec + (tv.tv_usec * 1e-6);
}
int compare( const void * pa, const void * pb ) {
const int a = *((const int*) pa);
const int b = *((const int*) pb);
return (a-b);
}
void merge(int * array, int * workspace, int low, int mid, int high) {
int i = low;
int j = mid + 1;
int l = low;
while( (l <= mid) && (j <= high) ) {
if( array[l] <= array[j] ) {
workspace[i] = array[l];
l++;
} else {
workspace[i] = array[j];
j++;
}
i++;
}
if (l > mid) {
for(int k=j; k <= high; k++) {
workspace[i]=array[k];
i++;
}
} else {
for(int k=l; k <= mid; k++) {
workspace[i]=array[k];
i++;
}
}
for(int k=low; k <= high; k++) {
array[k] = workspace[k];
}
}
void mergesort_impl(int array[],int workspace[],int low,int high) {
const int threshold = 1000000;
if( high - low > threshold ) {
int mid = (low+high)/2;
/* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
/* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
} else if (high - low > 0) {
/* Coarsen the base case */
qsort(&array[low],high-low+1,sizeof(int),compare);
}
}
void mergesort(int array[],int workspace[],int low,int high) {
#ifdef _OPENMP
#pragma omp parallel
#endif
{
#ifdef _OPENMP
#pragma omp single nowait
#endif
mergesort_impl(array,workspace,low,high);
}
}
const size_t largest = 100000000;
const size_t length = 10000000;
int main(int argc, char *argv[]) {
int * array = NULL;
int * workspace = NULL;
double start,end;
printf("Largest random number generated: %d \n",RAND_MAX);
printf("Largest random number after truncation: %d \n",largest);
printf("Array size: %d \n",length);
/* Allocate and initialize random vector */
array = (int*) malloc(length*sizeof(int));
workspace = (int*) malloc(length*sizeof(int));
for( int ii = 0; ii < length; ii++)
array[ii] = rand()%largest;
/* Sort */
getTime(&start);
mergesort(array,workspace,0,length-1);
getTime(&end);
printf("Elapsed time sorting: %g sec.\n", end-start);
/* Check result */
for( int ii = 1; ii < length; ii++) {
if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
}
free(array);
free(workspace);
return 0;
}
Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.
It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.
As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.
The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.
The fix is to replace
#pragma omp parallel sections
with
#pragma omp sections
After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.
One thread:
time taken: 0.378794
Two threads:
time taken: 0.203178
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)
But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.
Related
I am learning OpenMP and found the following code to solve N-Queens problem. I want to make faster solution with parallel code, but I don't know how to do it. Everytime I tried something, it is only go to worse and slower solution.
This is code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define BOARD_SIZE 15
int board[20],solutions;
int main() {
double start_time, end_time, result_time;
void queen(int row,int n);
start_time = omp_get_wtime();
queen(1, BOARD_SIZE);
end_time = omp_get_wtime();
result_time = end_time - start_time;
printf("Time = %.3f seconds\n", result_time);
printf("Number of solutions: %d\n", solutions);
return 0;
}
/*funtion to check conflicts
If no conflict for desired postion returns 1 otherwise returns 0*/
int place(int row, int column) {
for(int i=1;i<=row-1;++i) {
//checking column and diagonal conflicts
if(board[i]==column)
return 0;
else
if(abs(board[i]-column)==abs(i-row))
return 0;
}
return 1; //no conflicts
}
//function to check for proper positioning of queen
void queen(int row,int n) {
for(int column=1;column<=n;++column) {
if(place(row,column)) {
board[row]=column; //no conflicts so place queen
if(row==n) //dead end
solutions++;
else //try queen with next position
queen(row+1,n);
}
}
}
What can I do for optimize this code?
To parallelize your code with OpenMP I did the following:
I have added #pragma omp parallel and #pragma omp single before the first call of function queen.
To avoid data race I added #pragma omp atomic before solutions++; (Note that I have also tried using reduction, but in that case tasks have to be synchronized, which makes it a slower option).
To make parallel recursive calls I had to create a (private) copy of the board (and to pass its pointer to function queen). To create tasks I used #pragma omp task if(row<4). Note that if(row<4) is used to limit the number of tasks created.
The measured speedup is about 6x on 8 cores.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define BOARD_SIZE 15
int solutions;
int main() {
double start_time, end_time, result_time;
int board[BOARD_SIZE+1];
void queen(int row,int n, int*);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp single
queen(1, BOARD_SIZE, board);
end_time = omp_get_wtime();
result_time = end_time - start_time;
printf("Time = %.3f seconds\n", result_time);
printf("Number of solutions: %d\n", solutions);
return 0;
}
/*funtion to check conflicts
If no conflict for desired postion returns 1 otherwise returns 0*/
int place(int row, int column, int* board) {
for(int i=1;i<=row-1;++i) {
//checking column and diagonal conflicts
if(board[i]==column)
return 0;
else
if(abs(board[i]-column)==abs(i-row))
return 0;
}
return 1; //no conflicts
}
//function to check for proper positioning of queen
void queen(int row,int n, int* board) {
for(int column=1;column<=n;++column) {
if(place(row,column, board)) {
if(row==n) //dead end
{
#pragma omp atomic
solutions++;
}
else
//try queen with next position
{
int local_board[BOARD_SIZE+1];
for(int i=1;i<row;i++) local_board[i]=board[i]; //copy board to local_board
local_board[row]=column; //no conflicts so place queen
#pragma omp task if(row<4)
queen(row+1,n, local_board);
}
}
}
}
Please indicate if you need more explanation.
I see no omp directives in your code.
As Jerome suggested, the way to do this is with tasks.
I coded this a while ago. Download my textbook and see the chapter with OMP examples (or html version. The solution uses taskloop and omp cancel taskloop.
Unfortunately, you get no speedup: the sequential code finds the first solution very fast, and going parallel doesn't make it faster.
Oh, sorry, this uses C++. I'm sure the translation to C is simple enough. But C++ is a better language these days.
I have been trying to parallelize computing the sum value of series using a certain number of terms to the processors using block allocation.
In this program, I am generating arithmetic series and want to pass array as a shared variable in the pragma parallel directive.
but error coming in line #pragma omp parallel num_threads(comm_sz, number, BLOCK_LOW, BLOCK_HIGH, a[n], first, difference, global_sum1) providing the error below
expected a)a before a[a token
I am new to OPENMP-C. I have written th code below and facing the error above. -I researched in google but unable to find the solution.
Kindly help me how to declare array as shared variable in pragma parallel directive. I am attaching the code below
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main (int argc, char *argv[])
{
int rank, comm_sz;
int number, i, first, difference, global_sum1, global_sum, nprocs, step, local_sum1, local_n;
int* a;
int BLOCK_LOW, BLOCK_HIGH;
double t0, t1;
comm_sz = atoi(argv[1]);
first = atoi(argv[2]);
difference = atoi(argv[3]);
number = atoi(argv[4]);
omp_set_num_threads (comm_sz);
rank = omp_get_thread_num();
a = (int*) malloc (n*sizeof(int));
printf("comm_sz=%d, first=%d, difference=%d, number of terms=%d\n",comm_sz, first, difference, number);
for(i=1; i <= number; i++){
a[i-1] = first + (i-1)*difference;
printf("a[%d]=%d\n",i-1,a[i]);
}
for(i=0; i < number; i++){
printf("a[%d]=%d\n",i,a[i]);}
t0 = omp_get_wtime();
#pragma omp parallel num_threads(comm_sz, number, BLOCK_LOW, BLOCK_HIGH, a[n], first, difference, global_sum1)
{
BLOCK_LOW = (rank * number)/comm_sz;
BLOCK_HIGH = ((rank+1) * number)/comm_sz;
#pragma omp parallel while private(i, local_sum1)
//int local_sum1 = 0;
i=BLOCK_LOW;
while( i < BLOCK_HIGH )
{
printf("%d, %d\n",BLOCK_LOW,BLOCK_HIGH);
local_sum1 = local_sum1 + a[i];
i++;
}
//global_sum1 = global_sum1 + local_sum1;
#pragma omp while reduction(+:global_sum1)
i=0;
for (i < comm_sz) {
global_sum1 = global_sum1 + local_sum1;
i++;
}
}
step = 2*first + (n-1)*difference;
sum = 0.5*n*step;
printf("sum is %d\n", global_sum );
t1 = omp_get_wtime();
printf("Estimate of pi: %7.5f\n", global_sum1);
printf("Time: %7.2f\n", t1-t0);
}
I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.
I'm trying to do a parallel for inside a while, somothing like this:
while(!End){
for(...;...;...) // the parallel for
...
// serial code
}
The for loop is the only parallel section of the while loop. If I do this, I have a lot of overhead:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
Each iteration of the for loop are indepent with each other.
There are dependencies between serial code and parallel code.
So normally one doesn't have to worry too much about putting parallel regions into loops, as modern openmp implementations are pretty efficient about using things like thread teams and as long as there's lots of work in the loop you're fine. But here, with an outer loop count of ~1e9 and an inner loop count of ~256 - and very little work being done per iteration - the overhead is likely comparable to or worse than the amount of work being done and performance will suffer.
So there will be a noticeable difference between this:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
and this:
cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
#pragma omp for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
#pragma omp single
{
++cycles;
}
}
But really, that scan across the time array every iteration is unfortunately both (a) slow and (b) not enough work to keep multiple cores busy - it's memory intensive. With more than a couple of threads you will actually have worse performance than serial, even without overheads, just because of memory contention. Admittedly what you have posted here is just an example, not your real code, but why don't you preprocess the time array so you can just check to see when the next task is ready to update:
#include <stdio.h>
#include <stdlib.h>
struct tasktime_t {
long int time;
int task;
};
int stime_compare(const void *a, const void *b) {
return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}
int main(int argc, char **argv) {
const int n=256;
const long int niters = 100000000l;
long int time[n];
int wbusy[n];
int wfinished[n];
for (int i=0; i<n; i++) {
time[i] = rand() % niters;
wbusy[i] = 1;
wfinished[i] = 0;
}
struct tasktime_t stimes[n];
for (int i=0; i<n; i++) {
stimes[i].time = time[i];
stimes[i].task = i;
}
qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);
long int cycles = 0;
int next = 0;
while(cycles < niters){ // 1k Million iterations aprox
while ( (next < n) && (stimes[next].time == cycles) ) {
int i = stimes[next].task;
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
next++;
}
++cycles;
}
return 0;
}
This is ~5 times faster than the serial version of the scanning approach (and much faster than the OpenMP versions). Even if you are constantly updating the time/wbusy/wfinished arrays in the serial code, you can keep track of their completion times using a priority queue with each update taking O(ln(N)) time instead of scanning every iteration taking O(N) time.
I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).