Code Parallelisation using MPI in C - c

I am performing code parallelization using MPI to evaluate the cost function.
I dividing population for 50,000 points among 8 processors.
I am trying to parallelize the following code but struggling with it:
//mpiWorldSize is number of processors
//=====================================
for (int k=1; k< mpiWorldSize; k++)
{
MPI_Send(params[1][mpiWorldRank*over+k],7,MPI_INT, 0,0,MPI_COMM_WORLD);
}
// evaluate all the new costs
//=========================
for (int j=1; j<mpiWorldSize;j++)
{
MPI_Recv( params[1][mpiWorldRank*over+k],7,MPI_INT,j,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}
// memory allocation
//=========================
SecNewCostValues = (float*) malloc(noOfDataPerProcessor/bufferLength);
//loop throw nuber of data per proc
for ( i = 0; i < over; i++ )
{
if(mpiWorldRank != 0)
{
SecNewCostValues[i] = cost( params[1][mpiWorldRank*noOfDataPerPreocessor+i] );
newCostValues[over] = cost( params[1][i] ); //change the i part to rank*nodpp+i
printf("hello from rank %d: %s\n", mpiWorldRank ,procName );
}
}
I can't send and receive the data from different processors except 0.
I will appreciate any help.
thanks

MPI uses Single Program Multiple Data message passing programming model, that is all MPI processes execute the same program and you need to use conditionally to decide which process will execute which part of the code . The overall structure of your code could be as follows (assuming master with rank 0 distributes work and worker receive work).
if (myrank == 0) { // master
for (int k = 1; k < mpiWorldSize; k++) { // send a chunk to each worker
MPI_Send(...);
}
}
else { // worker
MPI_Recv(...); // receive work
}
Analogously master would collect work. Check out documentation on MPI_Scatter() and MPI_Gather() collective communication functions which seem relevant.

Related

MPI: Process helping another one when it finishes its work

Here's my problem:
Given an initial interval [a,b] to parallelize & supposing that there are processes faster than others, I would like to make a process i go "help" another one (j) when it finishes its chunk (work), helping in this case means dividing equally the chunk of process j (the one that's still working) between it & process i (the one that's going to help). I've got the idea of this algorithm but I don't know how to use MPI communication functions (such as Broadcast, Allgather, send, recv) to implement it: I would like to use an array "arr" shared by all processes & which size's equals the number of all processes. arr[i] = 1 means that process i has finished working, otherwise it's equal to -1. I initialize all its elements to -1, when a process of rank "rank" finishes its work, it does arr[rank] = 1 & keeps waiting for a working process to notice it & send it a new chunk. Here's a "pseudo-code" of what I would like to achieve:
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &rank );
MPI_Comm_size ( MPI_COMM_WORLD, &nb_proc );
int i;
int a = 0, b = max; //initial interval [a,b] to parallelize
int j; arr[nb_proc];
for(j = 0; j < 10; j++)
{
arr[j] = -1; //initially, all processes are working
}
do
{
i = a + rank;
do
{
if(there's a free process) // checking the array "arr" & finding at least one element that equals 1
{
//Let that process be process of rank "r":
arr[r] = -1;
int mid = (b+i)/2; //dividing the rest of the work
a = mid + rank - r;
MPI_Send(a to process r);
MPI_Send(b to process r);
b = a-1;
}
/*does i-th iteration*/
i = i + p;
}
while(i <= b);
arr[rank] = 1; //finished working and about to start waiting for new work
}
while(there's at least one process that's still working & if it's the case get the new work (starting a & finishing b) from it);
MPI_Finalize ( );
return 0;
My main problem concerns the way to access to the array and how to be sure it's updated to all processes and that all processes have the same array at an instant t. I would appreciate your help a lot. Thanks in advance.

Informative "if" statement in "for" loop

Normally when I have a big for loop I put messages to inform me in which part of the process my program is, for example:
for(i = 0; i < large_n; i++) {
if( i % (large_n)/1000 == 0) {
printf("We are at %ld \n", i);
}
// Do some other stuff
}
I was wondering if this hurts too much the performance (a priori) and if it is the case if there is a smarter alternative.Thanks in advance.
Maybe you can split the large loop in order to check the condition sometimes only, but I don't know if this will really save time, that depends more on your "other stuff".
int T = ...; // times to check the condition, make sure large_n % T == 0
for(int t = 0; t < T; ++t)
{
for(int i = large_n/T * t; i < large_n/T * (t+1); ++i)
{
// other stuff
}
printf("We are at %ld \n", large_n/T * (t+1));
}
Regardless of what is in your loop, I wouldn't be leaving statements like printf in unless it's essential to the application/user, nor would I use what are effectively redundant if statements, for the same reason.
Both of these are examples of trace level debugging. They're totally valid and in some cases very useful, but generally not ultimately so in the end application. In this respect, a usual thing to do is to only include them in the build when you actually want to use the information they provide. In this case, you might do something like this:
#define DEBUG
for(i = 0; i < large_n; i++)
{
#ifdef DEBUG
if( i % (large_n)/1000 == 0)
{
printf("We are at %ld \n", i);
}
#endif
}
Regarding the performance cost of including these debug outputs all the time, it will totally depend on the system you're running, the efficiency of whatever "printing" statement you're using to output the data, the check/s you're performing and, of course, how often you're trying to perform output.
Your mod test probably doesn't hurt performance but if you want a very quick test and you're prepared for multiples of two then consider a mathematical and test:
if ( ( i & 0xFF ) == 0 ) {
/* this gets printed every 256 iterations */
...
}
or
if ( ( i & 0xFFFF ) == 0 ) {
/* this gets printed every 65536 iterations */
...
}
By placing a print statement inside of the for loop, you are sacrificing some performance.
Because the program needs to do a system call to write output to the screen every time the message is printed, it takes CPU time away from the program itself.
You can see the difference in performance between these two loops:
int i;
printf("Start Loop A\n");
for(i = 0; i < 100000; i++) {
printf("%d ", i);
}
printf("Done with Loop A\n");
printf("Start Loop B\n");
for(i = 0; i < 100000; i++) {
// Do Nothing
}
printf("Done with Loop B\n");
I would include timing code, but I am in the middle of work and can update it later over lunch.
If the difference isn't noticeable, you can increase 100000 to a larger number (although too large a number would cause the first loop to take WAY too long to complete).
Whoops, forgot to finish my answer.
To cut down on the number of system calls your program needs to make, you could check a condition first, and only print if that condition is true.
For example, if you were counting up as in my example code, you could only print out every 100th number by using %:
int i;
for(i = 0; i < 100000; i++) {
if(i%100 == 0)
printf("%d", i);
}
That will reduce the number of syscalls from ~100000 to ~1000, which in turn would increase the performance of the loop.
The problem is IO operation printf takes a much time than processor calculates. you can reduce the time if you can add them all and print finally.
Notation:
Tp = total time spent executing the progress statements.
Tn = total time spent doing the other normal stuff.
>> = Much greater than
If performance is your main criteria, you want Tn >> Tp. This strongly suggests that the code should be profiled so that you can pick appropriate values. The routine 'printf()' is considered a slow routine (much slower than %) and is a blocking routine (that is, the thread that calls it may pend waiting for a resource used by it).
Personally, I like to abstract away the progress indicator. It can be a logging mechanism,
a printf, a progress box, .... Heck, it may be updating a structure that is read by another thread/task/process.
id = progressRegister (<some predefined type of progress update mechanism>);
for(i = 0; i < large_n; i++) {
progressUpdate (id, <string>, i, large_n);
// Do some other stuff
}
progressUnregister(id);
Yes, there is some overhead in calling the routine 'progressUpdate()' on each iteration, but again, as long as Tn >> Tp, it usually is not that important.
Hope this helps.

How do I access and print the complete vector distributed among MPI workers?

How do I access a global vector from an individual thread in MPI?
I'm using a library - specifically, an ODE solver library - called CVODE (part of SUNDIALS). The library works with MPI, so that multiple threads are running in parallel. They are all running the same code. Each thread sends the thread "next to" it a piece of data. But I want one of the threads (rank=0) to print out the state of the data at some points.
The library includes functions so that each thread can access their own data (the local vector). But there is no method to access the global vector.
I need to output the values of all of the equations at specific times. To do so, I would need access to the global vector. Anyone know how get at all of the data in an MPI vector (using CVODE, if possible)?
For example, here is my code that each thread runs
for (iout=1, tout=T1; iout <= NOUT; iout++, tout += DTOUT) {
flag = CVode(cvode_mem, tout, u, &t, CV_NORMAL);
if(check_flag(&flag, "CVode", 1, my_pe)) break;
if (my_pe == 0) PrintData(t, u);
}
...
static void PrintData(realtype t, N_Vector u) {
I want to print data from all threads in here
}
In function f (the function I'm solving), I pass data back and forth using MPI_Send and MPI_Recv. But I can't really do that in PrintData because the other processes have run ahead. Also, I don't want to add messaging overhead. I want to access the global vector in PrintData, and then just print out what's needed. Is it possible?
Edit: While waiting for a better answer, I programmed each thread passing the data back to the 0th thread. I don't think that's adding too much messaging overhead, but I'd still like to hear from you experts if there's a better method (I'm sure there isn't any worse ones! :D ).
Edit 2: Although angainor's solution is surely superior, I stuck with the one I had created. For future reference of anyone who has the same question, here is the basics of how I did it:
/* Is called by all threads */
static void PrintData(realtype t, N_Vector u, UserData data) {
... declarations and such ...
for (n=1; n<=my_length; n++) {
mass_num = my_base + n;
z[mass_num - 1] = udata[n-1];
z[mass_num - 1 + N] = udata[n - 1 + my_length];
}
if (my_pe != 0) {
MPI_Send(&z, 2*N, PVEC_REAL_MPI_TYPE, 0, my_pe, comm);
} else {
for (i=1; i<npes; i++) {
MPI_Recv(&z1, 2*N, PVEC_REAL_MPI_TYPE, i, i, comm, &status);
for (n=0; n<2*N; n++)
z[n] = z[n] + z1[n];
}
... now I can print it out however I like...
return;
}
When using MPI the individual threads do not have access to a 'global'
vector. They are not threads, they are processes that can run on
different physical computers and therefore can not have direct access to global data.
To do what you want you can either send the vector to one of the MPI processes (you did that) and print it there, or to print local worker parts in sequence. Use a function like this:
void MPI_write_ivector(int thrid, int nthr, int vec_dim, int *v)
{
int i, j;
int curthr = 0;
MPI_Barrier(MPI_COMM_WORLD);
while(curthr!=nthr){
if(curthr==thrid){
printf("thread %i writing\n", thrid);
for(i=0; i<vec_dim; i++) printf("%d\n", v[i]);
fflush(stdout);
curthr++;
MPI_Bcast(&curthr, 1, MPI_INT, thrid, MPI_COMM_WORLD);
} else {
MPI_Bcast(&curthr, 1, MPI_INT, curthr, MPI_COMM_WORLD);
}
}
}
All MPI processes should call it at the same time since there is a barrier and broadcast inside. Essentially, the procedure makes sure that all the MPI processes print their vector part in order, starting from rank 0. The data is not messed up since only
one process writes at any given time.
In the example above, Broadcast is used since it gives more flexibility on the order in which the threads should print their results - the thread that currently outputs can decide, who comes next. You could also skip the broadcast and only use a barrier
void MPI_write_ivector(int thrid, int nthr, int vec_dim, int *v)
{
int i, j;
int curthr = 0;
while(curthr!=nthr){
if(curthr==thrid){
printf("thread %i writing\n", thrid);
for(i=0; i<vec_dim; i++) printf("%d\n", v[i]);
fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
curthr++;
}
}

MPI wrapper that imitates OpenMP's for-loop pragma

I am thinking about implementing a wrapper for MPI that imitates OpenMP's way
of parallelizing for loops.
begin_parallel_region( chunk_size=100 , num_proc=10 );
for( int i=0 ; i<1000 ; i++ )
{
//some computation
}
end_parallel_region();
The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.
EDITED in response to High Performance Mark.
I have no intention to simulate the OpenMP's shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.
for( int i=0 ; i<n ; i++ )
{
s = s + sin(x[i]);
}
So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.
I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.
Thanks.
There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.
I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?
In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.
Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.
EDIT
Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.
I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.
Yes, you could do this, for specific tasks. But you shouldn't.
Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
typedef struct state_t {
int globaln;
int localn;
int *locals;
int *offsets;
double *localin;
double *localout;
double (*map)(double);
} state;
state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
state *s = malloc(sizeof(state));
s->globaln = n;
s->map = map;
/* figure out decomposition */
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
s->locals = malloc(size * sizeof(int));
s->offsets = malloc(size * sizeof(int));
s->offsets[0] = 0;
for (int i=0; i<size; i++) {
s->locals[i] = (n+i)/size;
if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
}
/* allocate local arrays */
s->localn = s->locals[rank];
s->localin = malloc(s->localn*sizeof(double));
s->localout = malloc(s->localn*sizeof(double));
/* distribute */
MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
s->localin, s->locals[rank], MPI_DOUBLE,
0, MPI_COMM_WORLD);
return s;
}
double end_parallel_mapandsum(state **s) {
double localanswer=0., answer;
/* sum up local answers */
for (int i=0; i<((*s)->localn); i++) {
localanswer += ((*s)->localout)[i];
}
/* and get global result. Everyone gets answer */
MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
free( (*s)->localin );
free( (*s)->localout );
free( (*s)->locals );
free( (*s)->offsets );
free( (*s) );
return answer;
}
int main(int argc, char **argv) {
int rank;
double *inputs;
double result;
int n=100;
const double pi=4.*atan(1.);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
inputs = malloc(n * sizeof(double));
for (int i=0; i<n; i++) {
inputs[i] = 2.*pi/n*i;
}
}
state *s=begin_parallel_mapandsum(inputs, n, sin);
for (int i=0; i<s->localn; i++) {
s->localout[i] = (s->map)(s->localin[i]);
}
result = end_parallel_mapandsum(&s);
if (rank == 0) {
printf("Calculated result: %lf\n", result);
double trueresult = 0.;
for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
printf("True result: %lf\n", trueresult);
}
MPI_Finalize();
}
That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.
To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.
It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just
for( int i=0 ; i<localn ; i++ )
{
s = s + sin(x[i]);
}
and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.
Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:
http://www.hpcwire.com/hpcwire/2006-05-19/openmp_on_clusters-1.html

Bakery Lock when used inside a struct doesn't work

I'm new at multi-threaded programming and I tried to code the Bakery Lock Algorithm in C.
Here is the code:
int number[N]; // N is the number of threads
int choosing[N];
void lock(int id) {
choosing[id] = 1;
number[id] = max(number, N) + 1;
choosing[id] = 0;
for (int j = 0; j < N; j++)
{
if (j == id)
continue;
while (1)
if (choosing[j] == 0)
break;
while (1)
{
if (number[j] == 0)
break;
if (number[j] > number[id]
|| (number[j] == number[id] && j > id))
break;
}
}
}
void unlock(int id) {
number[id] = 0;
}
Then I run the following example. I run 100 threads and each thread runs the following code:
for (i = 0; i < 10; ++i) {
lock(id);
counter++;
unlock(id);
}
After all threads have been executed, the result of the shared counter is 10 * 100 = 1000 which is the expected value. I executed my program multiple times and the result was always 1000. So it seems that the implementation of the lock is correct. That seemed weird based on a previous question I had because I didn't use any memory barriers/fences. Was I just lucky?
Then I wanted to create a multi-threaded program that will use many different locks. So I created this (full code can be found here):
typedef struct {
int number[N];
int choosing[N];
} LOCK;
and the code changes to:
void lock(LOCK l, int id)
{
l.choosing[id] = 1;
l.number[id] = max(l.number, N) + 1;
l.choosing[id] = 0;
...
Now when executing my program, sometimes I get 997, sometimes 998, sometimes 1000. So the lock algorithm isn't correct.
What am I doing wrong? What can I do in order to fix it?
Is it perhaps a problem now that I'm reading arrays number and choosing from a struct
and that's not atomic or something?
Should I use memory fences and if so at which points (I tried using asm("mfence") in various points of my code, but it didn't help)?
With pthreads, the standard states that accessing a varable in one thread while another thread is, or might be, modifying it is undefined behavior. Your code does this all over the place. For example:
while (1)
if (choosing[j] == 0)
break;
This code accesses choosing[j] over and over while waiting for another thread to modify it. The compiler is entirely free to modify this code as follows:
int cj=choosing[j];
while(1)
if(cj == 0)
break;
Why? Because the standard is clear that another thread may not modify the variable while this thread may be accessing it, so the value can be assumed to stay the same. But clearly, that won't work.
It can also do this:
while(1)
{
int cj=choosing[j];
if(cj==0) break;
choosing[j]=cj;
}
Same logic. It is perfectly legal for the compiler to write back a variable whether it has been modified or not, so long as it does so at a time when the code could be accessing the variable. (Because, at that time, it's not legal for another thread to modify it, so the value must be the same and the write is harmless. In some cases, the write really is an optimization and real-world code has been broken by such writebacks.)
If you want to write your own synchronization functions, you have to build them with primitive functions that have the appropriate atomicity and memory visibility semantics. You must follow the rules or your code will fail, and fail horribly and unpredictably.

Resources