Parallelize a Function using openMP in C - c

I wrote a program which inputs matrix size and number of threads and then generated a random binary matrix of 0's and 1's. Then I need to find clusters of 1's and give each cluster a unique number.
I am getting the output correctly but I am having a problem parallelizing the function.
My professor asked me to break the matrix rows into "thread_cnt" parts. i.e.: thread size is 4 and matrix size is 8 then it breaks into 4 matrices having 2 rows each.
The code is as follows:
//Inputted Matrix size n and generated a binary matrix rand1[][]
//
begin = omp_get_wtime();
width = n/thread_cnt;
#pragma omp parallel num_threads(thread_cnt) for
for(d=0;d<n;d=d++)
{
b=d+width;
Mat(d,b);
d=(d-1)+width;
}
Mat(int w,int x)
{
//printf("\n Entered function\n");
for(i=w;i<x;i++)
{
for(j=0;j<n;j++)
{
//printf("\n Entered the loop also\n");
//printf("i = %d, j = %d\n",i,j);
if(rand1[i][j]==1)
{
rand1[i][j]=q;
adj(i,j,q);
q++;
}
}
}
}
adj(int p, int e, int m) //Function to find adjacent 1's
{
//printf("\n Entered adj function\n");
//printf("\n p = %d e = %d m = %d\n",p,e,m);
if (rand1[p][e+1] == 1)
{
//printf("Test1\n");
rand1[p][e+1]=m;
adj(p,e+1,m);
}
if (rand1[p+1][e] == 1)
{
rand1[p+1][e]=m;
//printf("Test2\n");
adj(p+1,e,m);
}
if (rand1[p][e-1] == 1 && e-1>=0)
{
rand1[p][e-1]=m;
//printf("Test3\n");
adj(p,e-1,m);
}
if (p-1>=0 && rand1[p-1][e] == 1)
{
rand1[p-1][e]=m;
//printf("Test4\n");
adj(p-1,e,m);
}
}
The code gives me correct output. But the time increases instead of decreasing when I increase the number of threads. For 1 thread I get 0.000076 and for 2 threads I get
0.000136.
It looks like its iterating instead of parallelizing.
Can anyone help me out on this?
PS: I need to show both Serial time and parallel time and show that I have got a performance increase because of parallelization.

The reason why time increases when thread number increases is that each thread is executing the first loop. It seems that, you don't give the submatrixes into threads, instead each threads operates on every submatrix i.e all matrix.
To make threads work on matrix seperately you should use their unique tid number which you can get with this line :
tid = omp_get_thread_num();
Then you should make a simple mapping : if pid is i operate on (i+1)th submatrix where 0<=i<=nthreads-1
Which can possibly be coded as:
Mat(i*width,i*width+width)

Related

Why am I getting huge slowdown when parallelising with OpenMP and using static scheduling?

I'm working to parallelise a disease spread model in c using OpenMP but am only seeing massive (order of magnitude) slowdown. I'll point out at the outset that I am a complete novice with both OpenMP and c.
The code loops over every point in the simulation and checks its status (susceptible, infected, recovered) and for each status, follows an algorithm to determine its status at the next time step.
I'll give the loop for infected points for illustrative purposes. Lpoints is a list of indices for points in the simulation, Nneigh gives the number of neighbours each point has and Lneigh gives the indices of these neighbours.
for (ipoint=0;ipoint<Nland;ipoint++) { //loop over all points
if (Lpoints_old[ipoint]==I) { //act on infected points
/* Probability Prec of infected population recovering */
xi = genrand();
if (xi<Pvac) { /* This point recovers (I->R) */
Lpoints[ipoint] = R;
/* printf("Point %d gained immunity\n",ipoint); */
}
else {
/* Probability of being blockaded by neighbours */
nsn = 0;
for (in=0;in<Nneigh[ipoint];in++) { /*count susceptible neighbours (nsn)*/
//if (npoint<0) printf("Bad npoint 1: %d in=%d\n",ipoint,in);
//fflush(stdout);
npoint = Lneigh[ipoint][in];
if (Lpoints_old[npoint]==S) nsn++;
}
Prob = (double)nsn*Pblo;
xi = genrand();
if (xi<Prob) { /* The population point is blockaded (I->R)*/
Lpoints[ipoint] = R;
}
else { /* Still infected */
Lpoints[ipoint] = I;
}
} /*else*/
} /*infected*/
} /*for*/
I tried to parallelise by adding #pragma omp parallel for default(shared) private(ipoint,xi,in,npoint,nsn,Prob) before the for loop. (I tried using default(none) as is generally recommended but it wouldn't compile.) On the small grid I am using to test the original series code runs in about 5 seconds and the OpenMP version runs in around 50.
I have searched for ages online and every similar problem seems to be the result of false cache sharing and has been solved by using static scheduling with a chunk size divisible by 8. I tried varying the chunk size to no effect whatsoever, only getting the timings to the original order when the chunk size surpassed the size of the problem (i.e. back to linearly carrying out on one thread.)
Slowdown doesn't seem any better when the problem is more appropriately scaled as far as I can tell either. I have no idea why this isn't working and what's going wrong. Any help greatly appreciated.

mergesort using 2 thread ID's

hi i want to use mergesort to an array using threads i need to use two thread id in order to sort recursionly
here is my code
void Recursive_Divition(int a[], int l, int h,int Degree_of_parallelism,int count)//this function divdis recursivly the array
{//and sends it to merge sort
struct thread_data thread_data_array[Degree_of_parallelism];
int i, len=(h-l+1);
// Using insertion sort for small sized array
if (len==2*Degree_of_parallelism)//a stoping condition if the array size is equal to the degree of parallisim then send the array to be sortrd
{
pthread_t * thread = (pthread_t *)malloc(Degree_of_parallelism*sizeof(pthread_t ));//creating a pointer to thread array
thread_data_array[count].low=l;//the arguments for the merge sort
thread_data_array[count].high=h/2;
pthread_create(&thread[count],NULL,threaded_merge_sort,(void *) &thread_data_array[count]);//create a thread and send it to the function
count++;
thread_data_array[count].low=h/2+1;//the argument for the other thread
thread_data_array[count].high=h;
pthread_create(&thread[count],NULL,threaded_merge_sort,(void *) &thread_data_array[count]);
pthread_join(thread[count-1],NULL);
pthread_join(thread[count],NULL);
free(thread);
}
if(len>Degree_of_parallelism){
Recursive_Divition(a,l,l+len/2-1,Degree_of_parallelism,count);//for the first half
Recursive_Divition(a,l+len/2,h,Degree_of_parallelism,count);//for the secound half
merge(a, l, l+len/2-1, h);
}
}
and here is the threaded merge sort
void *threaded_merge_sort(void *param)
{
printf("Create a thread %u\n", (unsigned int)pthread_self());//this function calls the merge sort function
struct thread_data *my_data;
my_data = (struct thread_data *) param;
int l=my_data->low;
int h=my_data->high;
printf("low is : %d high is : %d\n",l,h);
mergeSort(array,l,h);
pthread_exit(NULL);
and i got the following output the problem appears to be in the index and i didn't know how to fix it
Amount of numbers that sort: 16
Degree of parallelism: 8
Array Before Sort: 1,5,6,33,77,12,90,87,0,10,34,2,741,453,19,132
Create a thread 3158492928
low is : 0 high is : 1
Create a thread 3150100224
low is : 2 high is : 3
Create a thread 3141707520
low is : 4 high is : 3
Create a thread 3133314816
low is : 4 high is : 7
Create a thread 3122734848
low is : 8 high is : 5
Create a thread 3114342144
low is : 6 high is : 11
Create a thread 3105949440
Create a thread 3097556736
low is : 8 high is : 15
low is : 12 high is : 7
Array After Sort: 1,5,6,10,19,33,12,34,0,2,77,87,90,132,453,741
The indexing problem is here:
thread_data_array[count].low=l;//the arguments for the merge sort
thread_data_array[count].high=h/2;
[...]
thread_data_array[count].low=h/2+1;//the argument for the other thread
thread_data_array[count].high=h;
That does not correctly split the (sub)array into halves unless l is 0. For example, if h is 16 then your array break is always at 8, even if that's less than l.
Instead, you appear to want
thread_data_array[count].low = l;
thread_data_array[count].high= (l + h) / 2;
and
thread_data_array[count].low = (l + h) / 2 + 1;
thread_data_array[count].high=h;
Also, I suspect you want an else before the second if in Recursive_Divition(), and you may also need to add final else block.

Code Parallelisation using MPI in C

I am performing code parallelization using MPI to evaluate the cost function.
I dividing population for 50,000 points among 8 processors.
I am trying to parallelize the following code but struggling with it:
//mpiWorldSize is number of processors
//=====================================
for (int k=1; k< mpiWorldSize; k++)
{
MPI_Send(params[1][mpiWorldRank*over+k],7,MPI_INT, 0,0,MPI_COMM_WORLD);
}
// evaluate all the new costs
//=========================
for (int j=1; j<mpiWorldSize;j++)
{
MPI_Recv( params[1][mpiWorldRank*over+k],7,MPI_INT,j,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}
// memory allocation
//=========================
SecNewCostValues = (float*) malloc(noOfDataPerProcessor/bufferLength);
//loop throw nuber of data per proc
for ( i = 0; i < over; i++ )
{
if(mpiWorldRank != 0)
{
SecNewCostValues[i] = cost( params[1][mpiWorldRank*noOfDataPerPreocessor+i] );
newCostValues[over] = cost( params[1][i] ); //change the i part to rank*nodpp+i
printf("hello from rank %d: %s\n", mpiWorldRank ,procName );
}
}
I can't send and receive the data from different processors except 0.
I will appreciate any help.
thanks
MPI uses Single Program Multiple Data message passing programming model, that is all MPI processes execute the same program and you need to use conditionally to decide which process will execute which part of the code . The overall structure of your code could be as follows (assuming master with rank 0 distributes work and worker receive work).
if (myrank == 0) { // master
for (int k = 1; k < mpiWorldSize; k++) { // send a chunk to each worker
MPI_Send(...);
}
}
else { // worker
MPI_Recv(...); // receive work
}
Analogously master would collect work. Check out documentation on MPI_Scatter() and MPI_Gather() collective communication functions which seem relevant.

How do I access and print the complete vector distributed among MPI workers?

How do I access a global vector from an individual thread in MPI?
I'm using a library - specifically, an ODE solver library - called CVODE (part of SUNDIALS). The library works with MPI, so that multiple threads are running in parallel. They are all running the same code. Each thread sends the thread "next to" it a piece of data. But I want one of the threads (rank=0) to print out the state of the data at some points.
The library includes functions so that each thread can access their own data (the local vector). But there is no method to access the global vector.
I need to output the values of all of the equations at specific times. To do so, I would need access to the global vector. Anyone know how get at all of the data in an MPI vector (using CVODE, if possible)?
For example, here is my code that each thread runs
for (iout=1, tout=T1; iout <= NOUT; iout++, tout += DTOUT) {
flag = CVode(cvode_mem, tout, u, &t, CV_NORMAL);
if(check_flag(&flag, "CVode", 1, my_pe)) break;
if (my_pe == 0) PrintData(t, u);
}
...
static void PrintData(realtype t, N_Vector u) {
I want to print data from all threads in here
}
In function f (the function I'm solving), I pass data back and forth using MPI_Send and MPI_Recv. But I can't really do that in PrintData because the other processes have run ahead. Also, I don't want to add messaging overhead. I want to access the global vector in PrintData, and then just print out what's needed. Is it possible?
Edit: While waiting for a better answer, I programmed each thread passing the data back to the 0th thread. I don't think that's adding too much messaging overhead, but I'd still like to hear from you experts if there's a better method (I'm sure there isn't any worse ones! :D ).
Edit 2: Although angainor's solution is surely superior, I stuck with the one I had created. For future reference of anyone who has the same question, here is the basics of how I did it:
/* Is called by all threads */
static void PrintData(realtype t, N_Vector u, UserData data) {
... declarations and such ...
for (n=1; n<=my_length; n++) {
mass_num = my_base + n;
z[mass_num - 1] = udata[n-1];
z[mass_num - 1 + N] = udata[n - 1 + my_length];
}
if (my_pe != 0) {
MPI_Send(&z, 2*N, PVEC_REAL_MPI_TYPE, 0, my_pe, comm);
} else {
for (i=1; i<npes; i++) {
MPI_Recv(&z1, 2*N, PVEC_REAL_MPI_TYPE, i, i, comm, &status);
for (n=0; n<2*N; n++)
z[n] = z[n] + z1[n];
}
... now I can print it out however I like...
return;
}
When using MPI the individual threads do not have access to a 'global'
vector. They are not threads, they are processes that can run on
different physical computers and therefore can not have direct access to global data.
To do what you want you can either send the vector to one of the MPI processes (you did that) and print it there, or to print local worker parts in sequence. Use a function like this:
void MPI_write_ivector(int thrid, int nthr, int vec_dim, int *v)
{
int i, j;
int curthr = 0;
MPI_Barrier(MPI_COMM_WORLD);
while(curthr!=nthr){
if(curthr==thrid){
printf("thread %i writing\n", thrid);
for(i=0; i<vec_dim; i++) printf("%d\n", v[i]);
fflush(stdout);
curthr++;
MPI_Bcast(&curthr, 1, MPI_INT, thrid, MPI_COMM_WORLD);
} else {
MPI_Bcast(&curthr, 1, MPI_INT, curthr, MPI_COMM_WORLD);
}
}
}
All MPI processes should call it at the same time since there is a barrier and broadcast inside. Essentially, the procedure makes sure that all the MPI processes print their vector part in order, starting from rank 0. The data is not messed up since only
one process writes at any given time.
In the example above, Broadcast is used since it gives more flexibility on the order in which the threads should print their results - the thread that currently outputs can decide, who comes next. You could also skip the broadcast and only use a barrier
void MPI_write_ivector(int thrid, int nthr, int vec_dim, int *v)
{
int i, j;
int curthr = 0;
while(curthr!=nthr){
if(curthr==thrid){
printf("thread %i writing\n", thrid);
for(i=0; i<vec_dim; i++) printf("%d\n", v[i]);
fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
curthr++;
}
}

C array is displaying garbage data (memory problems?)

I'm making a driver for an 8x8 LED matrix that I'm driving from a computer's parallel port. It's meant to be a clock, inspired by a design I saw on Tokyoflash.
Part of the driver is an array of 3*5 number "sprites" that are drawn to the matrix. A coordinate of the matrix is assigned to a coordinate of the sprite and so forth, until the entire sprite is drawn on it. This process is repeated for the other digit with an offset. I have verified I have drawn the sprites correctly, and that the matrix is blank when it is written to. However, when I draw a number on the matrix I get errant 1s at Numpad6 for the left digit, Numpad1 for the right (Example with the left digit not drawn.)
I have a week of experience in C and this is baffling me.
Here is the driver in full if you want to compile it yourself. It is nowhere near finished.
//8x8 LED MATRIX DRIVER VER 0.1 APR062009
//CLOCK
//
// 01234567
// 0 BXXXXXXH B: Binary Mode Indicator
// 1 DXXXXXXM D: Decimal Mode Indicator
// 2 NNNNNNNN H: Hour Centric Display
// 3 LLLNNRRR M: Minute Centric Display
// 4 LNLNNRNR X: Secondary Information
// 5 LLLNNRRR L: Left Digit
// 6 LNLNNRNR R: Right digit
// 7 LLLNNRRR N: Not Used
#include <stdio.h>
#include <unistd.h>
//#include <math.h>
#include <time.h>
#include </usr/include/sys/io.h>
#define BASEPORT 0x378
int main()
{
//Increasing array parameters to seems to reduce glitching [best 10 5 3]
int Dig[10][5][3] = {0}; //ALPHANUMERIC ARRAY [NUMBER (0..9)][Y(0..4)][X(0..2)]
int Mat[7][7] = {0}; //[ROW][COL], Top L corner = [0][0]
int Aux1[7] = {0}; //Topmost Row
int Aux2[7] = {0}; //Second to Topmost Row
int Clk; //Clock
int Wait; //Delay; meant to eventually replace clock in its current state
int C1; //Counters
int C2;
int C3;
int L; //Left Digit
int R; //Right Digit
//break string left undefined atm
//ioperm (BASEPORT, 3, 1);
//outb(0, BASEPORT);
printf("Now running.\n");
//Set Variables
//3D DIGIT ARRAY [Num][Row][Col] (INITIALIZED BY INSTRUCTIONS)
//Dig array is meant to be read only once initialized
//3D arrays are unintuitive to declare so the numbers are
//"drawn" instead.
//Horizontals
//Some entries in the loop may have the variable in the middle
//coordinate instead of the 3rd and/or with a +2. This is to
//incorporate the incomplete columns some numbers have (eg "2") and
//saves coding additional loops.
for(C1=0; C1<=2; C1++){
Dig[0][0][C1]=1; Dig[0][4][C1]=1;
Dig[2][0][C1]=1; Dig[2][2][C1]=1; Dig[2][4][C1]=1; Dig[2][C1][2]=1; Dig[2][C1+2][0]=1;
Dig[3][0][C1]=1; Dig[3][2][C1]=1; Dig[3][4][C1]=1;
Dig[4][2][C1]=1; Dig[4][C1][0]=1;
Dig[5][0][C1]=1; Dig[5][2][C1]=1; Dig[5][4][C1]=1; Dig[5][C1][0]=1; Dig[5][C1+2][2]=1;
Dig[6][0][C1]=1; Dig[6][2][C1]=1; Dig[6][4][C1]=1; Dig[6][C1+2][2]=1;
Dig[7][0][C1]=1;
Dig[8][0][C1]=1; Dig[8][2][C1]=1; Dig[8][4][C1]=1;
Dig[9][0][C1]=1; Dig[9][2][C1]=1; Dig[9][4][C1]=1; Dig[9][C1][0]=1;
}
//Verticals
for(C1=0; C1<=4; C1++){
Dig[0][C1][0]=1; Dig[0][C1][2]=1;
Dig[1][C1][2]=1;
Dig[3][C1][2]=1;
Dig[4][C1][2]=1;
Dig[6][C1][0]=1;
Dig[7][C1][2]=1;
Dig[8][C1][0]=1; Dig[8][C1][2]=1;
Dig[9][C1][2]=1;
}
Clk=10000;
L=2; //Think about incorporating overflow protection for L,R
R=4;
//Print Left Digit to Matrix # (3, 0)
for(C1=0; C1<=4; C1++){ //For some reason produces column of 1s at numpad 6
for(C2=0; C2<=2; C2++){
Mat[C1+3][C2]=Dig[L][C1][C2];
printf("%d", Dig[L][C1][C2]); //Debug
}
printf(" %d %d %d\n", L, C1, C2); //Debug
}
//Print Right Digit to Matrix # (3, 5)
for(C1=0; C1<=4; C1++){ //For some reason produces column of 1s at numpad 1
for(C2=0; C2<=2; C2++){
Mat[C1+3][C2+5]=Dig[R][C1][C2];
}
}
//X Test Pattern
//for(C1=0; C1<=7; C1++){
// Mat[C1][C1]=5;
// Mat[7-C1][C1]=5;
//}
usleep(Clk);
//while(1){
//Breakfree [NOT FUNCTIONAL]
//Break_String=getch(); (Getch is not ANSI, need ncurses)
//if(Break_String != -1){
// if(Break_String = 27){
// break;
// }
//}
//Terminal Display
//for(C3=0; C3<=9; C3++){ //Debug Digit array [Successful, numbers draw correctly]
// for(C2=0; C2<=4; C2++){
// for(C1=0; C1<=2; C1++){
// printf("%d", Dig[C3][C2][C1]);
// }
// printf("\n");
// }
//printf("\n");
//usleep(1000000); //Debug
//}
usleep(3000000); //Debug
for(C1=0; C1<=7; C1++){ //Prints to terminal every second, when looping
for(C2=0; C2<=7; C2++){
printf("%d", Mat[C1][C2]);
}
printf("\n");
}
printf("\n");
//Hardware Display
for(C1=0; C1<=29; C1++){ //30 Hz
for(C3=0; C3<=7; C3++){ //COLUMN
//printf("%d %d \n", C3, C1); //Loop Debug
usleep(1000);
//CLOCK GROUND TO GO HERE, OUT STATUS
//for(C2=0; C2<=7; C2++){ //PX
//outb(Mat[C3][C2], BASEPORT);
//}
}
usleep(4*Clk);
}
//}
//ioperm(BASEPORT, 3, 0);
exit(0);
}
Also, I had to make my Sprite array bounds each one bigger than they should have been to make it work. I figure this is all some some memory snafu but I'm nowhere near that proficient in C to know what to do.
I would greatly appreciate any help.
I need to look through it more but one problem off the bat is that you're driving an 8x8 LED matrix but using a 7x7 matrix to hold the data. Declare your matrix as:
int Mat[8][8];
Bryan, I think you are just missing the fundamental understanding of how array indices work in C.
When you declare
int array[N]
you access the elements in a range of
array[0] ... array[N-1]
which gives you a total of N elements.
For example:
int array[4]
gives you
array[0]
array[1]
array[2]
array[3]
for a total of 4 elements.
When looping over this array, this is the convention that's almost always used:
for(i = 0; i < 4; i++)
I think that this issue is causing multiple problems in your code and if you go back over your arrays after understanding this you'll be able to fix the problems.
Bryan, I don't see the problem offhand, but what you're describing sounds like you are having an array indexing issue. (Rule of thumb, any time you think that there's something wrong with the computer causing your errors, you're mistaken.)
Two places new C programmers run into trouble with this is getting confused by 0 based indices -- an array of size 7 has indices from 0..6 -- and by not realizing that arrays are just laid out on top of one blob of memory, so an array that's [10][5][2] is really just one 100-cell piece of memory. If you make an indexing mistake, you can put things in what appear to be random places.
I'd go through the code and check what's where in smaller steps; what happens after one initialization, that sort of thing.

Resources