the error of MPI initialization - c

I have a question about parallelism. I have a portion of code which I have applied the concept of parallelizme and this part of code must be repeat N times in a loop, but I can not initialize the MPI in the loop because shows " MPI_Init(89): Cannot call MPI_INIT or MPI_INIT_THREAD more than once" and if I boot before the loop each process it will handle all the loop and it is not that the goal.
for (int i = 0; i <N; i ++)
{
the parallel area
}
I want that for every i in the loop, the K processes execute the parallel area.

The canonical way to distribute calculations in your case is to run the loop in all MPI processes and make each rank except rank 0 skip the serial parts:
// Obtain the rank
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i = 0; i < N; i++)
{
if (rank == 0)
{
// Some serial calculations
}
//
// Parallel part
//
if (rank == 0)
{
// More serial calculations within the loop
}
}
If you need to communicate data produced in rank 0 in the serial calculations, you could either use point-to-point operations or collectives like MPI_Bcast or MPI_Scatter at the beginning of the parallel part. You could then bring the distributed data back to rank 0 for further processing in the second serial part with MPI_Gather or MPI_Reduce at the end of the parallel block.
Another canonical approach is to use the master-worker pattern where one process (the master) distributes work items to a set of worker processes that are simply spinning in a loop of receive work -> process work -> return results.
As to the multiple initialisation of MPI, one could check if it has already been done:
int done_already;
MPI_Initialized(&done_already);
if (!done_already)
MPI_Init(NULL, NULL);
Note that once you finalise MPI by calling MPI_Finalize, it cannot be reinitialised for the duration of the current execution of the program.

Related

MPI get the number of non-finalized processes

I have the following code
int main(void) {
...
int numtasks, rank;
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int k = 2;
if (rank == 3) {
do {
...
} while (number_of_not_finalized_processes >= numtasks - k);
}
...
MPI_Finalize();
}
and want to loop process with rank = 3 until k processes are finalized with MPI_Finalize(). How can I find out, how many processes have been finalized and define number_of_not_finalized_processes ? Or maybe I have any better alternative way to find out how many processes are still alive or passed the certain part of code?
You can solve this with one-sided communication. Create a window where every process before it finalizes:
Acquires a lock on the window on rank 3
Uses MPI_Accumulate to increase the counter
This is a very asynchronous solution. If your processes do something synchronized, say a loop where they dynamically decide to exit the loop, you could for instance at the synchonization point create a subcommunicator of only the active processes.

Multi-threads program architecture C pthread

I'd like to create multi-threads program in C (Linux) with:
Infinite loop with infinite number of tasks
One thread per one task
Limit the total number of threads, so if for instance total threads number is more then MAX_THREADS_NUMBER, do sleep(), until total threads number become less then MAX_THREADS_NUMBER, continue after.
Resume: I need to do infinite number of tasks(one task per one thread) and I'd like to know how to implement it using pthreads in C.
Here is my code:
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THREADS 50
pthread_t thread[MAX_THREADS];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
pthread_mutex_lock(&lock);
counter += 1;
printf("Job %d started\n", counter);
pthread_mutex_unlock(&lock);
return NULL;
}
int main(void)
{
int i = 0;
int ret;
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
// Wait all threads to finish
for (i = 0; i < MAX_THREADS; i++) {
pthread_join(thread[i], NULL);
}
pthread_mutex_destroy(&lock);
return 0;
}
How to make this loop infinite?
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
I need something like this:
while (1) {
if (thread_number > MAX_THREADS_NUMBER)
sleep(1);
ret = pthread_create(...);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
Your current program is based on a simple dispatch design: the initial thread creates worker threads, assigning each one a task to perform. Your question is, how you make this work for any number of tasks, any number of worker threads. The answer is, you don't: your chosen design makes such a modification basically impossible.
Even if I were to answer your stated questions, it would not make the program behave the way you'd like. It might work after a fashion, but it'd be like a bicycle with square wheels: not very practical, nor robust -- not even fun after you stop laughing at how silly it looks.
The solution, as I wrote in a comment to the original question, is to change the underlying design: from a simple dispatch to a thread pool approach.
Implementing a thread pool requires two things: First, is to change your viewpoint from starting a thread and having it perform a task, to each thread in the "pool" grabbing a task to perform, and returning to the "pool" after they have performed it. Understanding this is the hard part. The second part, implementing a way for each thread to grab a new task, is simple: this typically centers around a data structure, protected with locks of some sort. The exact data structure does depend on what the actual work to do is, however.
Let's assume you wanted to parallelize the calculation of the Mandelbrot set (or rather, the escape time, or the number of iterations needed before a point can be ruled to be outside the set; the Wikipedia page contains pseudocode for exactly this). This is one of the "embarrassingly parallel" problems; those where the sub-problems (here, each point) can be solved without any dependencies.
Here's how I'd do the core of the thread pool in this case. First, the escape time or iteration count needs to be recorded for each point. Let's say we use an unsigned int for this. We also need the number of points (it is a 2D array), a way to calculate the complex number that corresponds to each point, plus some way to know which points have either been computed, or are being computed. Plus mutually exclusive locking, so that only one thread will modify the data structure at once. So:
typedef struct {
int x_size, y_size;
size_t stride;
double r_0, i_0;
double r_dx, i_dx;
double r_dy, i_dy;
unsigned int *iterations;
sem_t done;
pthread_mutex_t mutex;
int x, y;
} fractal_work;
When an instance of fractal_work is constructed, x_size and y_size are the number of columns and rows in the iterations map. The number of iterations (or escape time) for point x,y is stored in iterations[x+y*stride]. The real part of the complex coordinate for that point is r_0 + x*r_dx + y*r_dy, and imaginary part is i_0 + x*i_dx + y*i_dy (which allows you to scale and rotate the fractal freely).
When a thread grabs the next available point, it first locks the mutex, and copies the x and y values (for itself to work on). Then, it increases x. If x >= x_size, it resets x to zero, and increases y. Finally, it unlocks the mutex, and calculates the escape time for that point.
However, if x == 0 && y >= y_size, the thread posts on the done semaphore and exits, letting the initial thread know that the fractal is complete. (The initial thread just needs to call sem_wait() once for each thread it created.)
The thread worker function is then something like the following:
void *fractal_worker(void *data)
{
fractal_work *const work = (fractal_work *)data;
int x, y;
while (1) {
pthread_mutex_lock(&(work->mutex));
/* No more work to do? */
if (work->x == 0 && work->y >= work->y_size) {
sem_post(&(work->done));
pthread_mutex_unlock(&(work->mutex));
return NULL;
}
/* Grab this task (point), advance to next. */
x = work->x;
y = work->y;
if (++(work->x) >= work->x_size) {
work->x = 0;
++(work->y);
}
pthread_mutex_unlock(&(work->mutex));
/* z.r = work->r_0 + (double)x * work->r_dx + (double)y * work->r_dy;
z.i = work->i_0 + (double)x * work->i_dx + (double)y * work->i_dy;
TODO: implement the fractal iteration,
and count the iterations (say, n)
save the escape time (number of iterations)
in the work->iterations array; e.g.
work->iterations[(size_t)x + work->stride*(size_t)y] = n;
*/
}
}
The program first creates the fractal_work data structure for the worker threads to work on, initializes it, then creates some number of threads giving each thread the address of that fractal_work structure. It can then call fractal_worker() itself too, to "join the thread pool". (This pool automatically "drains", i.e. threads will return/exit, when all points in the fractal are done.)
Finally, the main thread calls sem_wait() on the done semaphore, as many times as it created worker threads, to ensure all the work is done.
The exact fields in the fractal_work structure above do not matter. However, it is at the very core of the thread pool. Typically, there is at least one mutex or rwlock protecting the work details, so that each worker thread gets unique work details, as well as some kind of flag or condition variable or semaphore to let the original thread know that the task is now complete.
In a multithreaded server, there is usually only one instance of the structure (or variables) describing the work queue. It may even contain things like minimum and maximum number of threads, allowing the worker threads to control their own number to dynamically respond to the amount of work available. This sounds magical, but is actually simple to implement: when a thread has completed its work, or is woken up in the pool with no work, and is holding the mutex, it first examines how many queued jobs there are, and what the current number of worker threads is. If there are more than the minimum number of threads, and no work to do, the thread reduces the number of threads, and exits. If there are less than the maximum number of threads, and there is a lot of work to do, the thread first creates a new thread, then grabs the next task to work on. (Yes, any thread can create new threads into the process. They are all on equal footing, too.)
A lot of the code in a practical multithreaded application using one or more thread pools to do work, is some sort of bookkeeping. Thread pool approaches very much concentrates on the data, and the computation needed to be performed on the data. I'm sure there are much better examples of thread pools out there somewhere; the hard part is to think of a good task for the application to perform, as the data structures are so task-dependent, and many computations are so simple that parallelizing them makes no sense (since creating new threads does have a small computational cost, it'd be silly to waste time creating threads when a single thread does the same work in the same or less time).
Many tasks that benefit from parallelization, on the other hand, require information to be shared between workers, and that requires a lot of thinking to implement correctly. (For example, although solutions exist for parallelizing molecular dynamics simulations efficiently, most simulators still calculate and exchange data in separate steps, rather than at the same time. It's just that hard to do right, you see.)
All this means that you cannot expect to be able to write the code, unless you understand the concept. Indeed, truly understanding the concepts are the hard part: writing the code is comparatively easy.
Even in the above example, there are certain tripping points: Does the order of posting the semaphore and releasing the mutex matter? (Well, it depends on what the thread that is waiting for the fractal to complete does -- and indeed, if it is waiting yet.) If it was a condition variable instead of a semaphore, it would be essential that the thread that is interested in the fractal completion is waiting on the condition variable, otherwise it would miss the signal/broadcast. (This is also why I used a semaphore.)

How to call same rank in MPI using C language?

I am trying to learn MPI programing and have written the following program. It adds an entire row of array and outputs the sum. At rank 0 (or process 0), it will call all its slave ranks to do the calculation. I want to do this using only two other slave ranks/process. Whenever I try to invoke same rank twice as shown in the code bellow, my code would just hang in the middle and wouldn't execute. If I don't call the same rank twice, code would work correctly
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int tag2 = 1;
int arr[30] = {0};
MPI_Request request;
MPI_Status status;
printf ("\n--Current Rank: %d\n", world_rank);
int index;
int source = 0;
int dest;
if (world_rank == 0)
{
int i;
printf("* Rank 0 excecuting\n");
index = 0;
dest = 1;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = i + 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 0;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2; //Problem happens here when I try to call the same destination(or rank 2) twice
//If I change this dest value to 3 and run using: mpirun -np 4 test, this code will work correctly
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
}
else
{
int sum = 0;
int i;
MPI_Irecv(&arr[0], 30, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<30; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
}
MPI_Finalize();
}
Result when using: mpirun -np 3 test
--Current Rank: 2
Sum is: 0 at rank: 2
--Current Rank: 0
* Rank 0 excecuting
--Current Rank: 1
Sum is: 524800 at rank: 1
//Program hangs here and wouldn't show sum of 30
Please let me know how can I call the same rank twice. For example, if I only had two other slave process that I can call.
Please show with an example if possible
In MPI, each process executes the same code and, as you're doing, you differentiate the different processes primarily through checking rank in if/else statements. The master process with rank 0 is doing 3 sends: a send to process 1, then two sends to process 2. The slave processes each do only one receive, which means that rank 1 receives its first message and rank 2 receives its first message. When you call the third MPI_Send on process 0, there is not and will not be any slave waiting to receive the message after that point, as the slaves have finished executing their else block. The program gets blocked as the master waits to send the final message.
In order to fix that, you have to make sure the slave of rank 2 performs two receives, either by adding a loop for that process only or by repeating for that process only (so, using an if(world_rank == 2) check) the block of code
sum = 0; //resetting sum
MPI_Irecv(&arr[0], 1024, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<1024; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
TL;DR: Just a remark that master/slave approach isn't to be favoured and might format programmer's mind durably, leading to poor code when put in production
Although Clarissa is perfectly correct and her answer is very clear, I'd like to add a few general remarks, not about the code itself, but about parallel computing philosophy and good habits.
First a quick preamble: when one wants to parallelise one's code, it can be for two main reasons: making it faster and/or permitting it to handle larger problems by overcoming the limitations (like memory limitation) found on a single machine. But in all cases, performance matters and I will always assume that MPI (or generally speaking parallel) programmers are interested and concerned by performance. So the rest of my post will suppose that you are.
Now the main reason of this post: over the past few days, I've seen a few questions here on SO about MPI and parallelisation, obviously coming from people eager to learn MPI (or OpenMP for that matter). This is great! Parallel programming is great and there'll never be enough parallel programmers. So I'm happy (and I'm sure many SO members are too) to answer questions helping people to learn how to program in parallel. And in the context of learning how to program in parallel, you have to write some simple codes, doing simple things, in order to understand what the API does and how it works. These programs might look stupid from the distance and very ineffective, but that's fine, that's how leaning works. Everybody learned this way.
However, you have to keep in mind that these programs you write are only that: API learning exercises. They're not the real thing and they do not reflect the philosophy of what an actual parallel program is or should be. And what drives my answer here is that I've seen here and in other questions and answers put forward recurrently the status of "master" process and "slaves" ones. And that is wrong, fundamentally wrong! Let me explain why:
As Clarissa perfectly pinpointed, "in MPI, each process executes the same code". The idea is to find a way of making several processes to interact for working together in solving a (possibly larger) problem (hopefully faster). But amongst these processes, none gets any special status, they are all equal. They are given an id to be able to address them, but rank 0 is no better than rank 1 or rank 1025... By artificially deciding that process #0 is the "master" and the others are its "slaves", you break this symmetry and it has consequences:
Now that rank #0 is the master, it commands, right? That's what a master does. So it will be the one getting the information necessary to run the code, will distribute it's share to the workers, will instruct them to do the processing. Then it will wait for the processing to be concluded (possibly getting itself busy in-between but more likely just waiting or poking the workers, since that's what a master does), collect the results, do a bit of reassembling and output it. Job done! What's wrong with that?
Well, the following are wrong:
During the time the master gets the data, slaves are sitting idle. This is sequential and ineffective processing...
Then the distribution of the data and the work to do implies a lot of transfers. This takes time and since it is solely between process #0 and all the others, this might create a lot of congestions on the network in one single link.
While workers do their work, what should the master do? Working as well? If yes, then it might not be readily available to handle requests from the slaves when they come, delaying the whole parallel processing. Waiting for these requests? Then it wastes a lot of computing power by sitting idle... Ultimately, there is no good answer.
Then points 1 and 2 are repeated in reverse order, with the gathering of the results and outputting or results. That's a lot of data transfers and sequential processing, which will badly damage the global scalability, effectiveness and performance.
So I hope you now see why master/slaves approach is (usually, not always but very often) wrong. And the danger I see from the questions and answers I've read of the past days is that you might get you mind formatted in this approach as if it was the "normal" way of thinking in parallel. Well, it is not! Parallel programming is symmetry. It is handling the problem globally, in all places at the same time. You have to think parallel from the start and see your code as a global parallel entity, not just a brunch of processes that need to be instructed on what to do. Each process is it's own master, dealing with it's peers on an equal ground. Each process should (as much as possible) acquire its data by itself (making it a parallel processing); decide what to do based on the number of peers involved in the processing and its id; exchange information with its peers when necessary, should it be locally (point to point communications) or globally (collective communications); and issue its own share of the result (again leading to parallel processing)...
OK, that's a bit extreme a requirement for people just starting to learn parallel programming and I want by no mean to tell you that your learning exercises should be like this. But keep the goal in mind and don't forget that API learning examples are only API learning examples, not reduced models of actual codes. So keep on experimenting with MPI calls to understand what they do and how they work, but try to slowly tend towards symmetrical approach on your examples. That can only be beneficial for you in the long term.
Sorry for this lengthy and somewhat off topic answer, and good luck with your parallel programming.

Changing value of a variable with MPI

#include<stdio.h>
#include<mpi.h>
int a=1;
int *p=&a;
int main(int argc, char **argv)
{
MPI_Init(&argc,&argv);
int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
//printf("Address val: %u \n",p);
*p=*p+1;
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
printf("Value of a : %d\n",*p);
return 0;
}
Here, I am trying to execute the program with 3 processes where each tries to increment the value of a by 1, so the value at the end of execution of all processes should be 4. Then why does the value printed as 2 only at the printf statement after MPI_Finalize(). And isnt it that the parallel execution stops at MPI_Finalize() and there should be only one process running after it. Then why do I get the print statement 3 times, one for each process, during execution?
It is a common misunderstanding to think that mpi_init starts up the requested number of processes (or whatever mechanism is used to implement MPI) and that mpi_finalize stops them. It's better to think of mpi_init starting the MPI system on top of a set of operating-system processes. The MPI standard is silent on what MPI actually runs on top of and how the underlying mechanism(s) is/are started. In practice a call to mpiexec (or mpirun) is likely to fire up a requested number of processes, all of which are alive when the program starts. It is also likely that the processes will continue to live after the call to mpi_finalize until the program finishes.
This means that prior to the call to mpi_init, and after the call to mpi_finalize it is likely that there is a number of o/s processes running, each of them executing the same program. This explains why you get the printf statement executed once for each of your processes.
As to why the value of a is set to 2 rather than to 4, well, essentially you are running n copies of the same program (where n is the number of processes) each of which adds 1 to its own version of a. A variable in the memory of one process has no relationship to a variable of the same name in the memory of another process. So each process sets a to 2.
To get any data from one process to another the processes need to engage in message-passing.
EDIT, in response to OP's comment
Just as a variable in the memory of one process has no relationship to a variable of the same name in the memory of another process, a pointer (which is a kind of variable) has no relationship to a pointer of the same name in the memory of another process. Do not be fooled, if the ''same'' pointer has the ''same'' address in multiple processes, those addresses are in different address spaces and are not the same, the pointers don't point to the same place.
An analogy: 1 High Street, Toytown is not the same address as 1 High Street, Legotown; there is a coincidence in names across address spaces.
To get any data (pointer or otherwise) from one process to another the processes need to engage in message-passing. You seem to be clinging to a notion that MPI processes share memory in some way. They don't, let go of that notion.
Since MPI is only giving you the option to communicate between separate processes, you have to do message passing. For your purpose there is something like MPI_Allreduce, which can sum data over the separate processes. Note that this adds the values, so in your case you want to sum the increment, and add the sum later to p:
int inc = 1;
MPI_Allreduce(MPI_IN_PLACE, &inc, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
*p += inc;
In your implementation there is no communication between the spawned threads. Each process has his own int a variable which it increments and prints to the screen. Making the variable global doesn't make it shared between processes and all the pointer gimmicks show me that you don't know what you are doing. I would suggest learning a little more C and Operating Systems before you move on.
Anyway, you have to make the processes communicate. Here's how an example might look like:
#include<stdio.h>
#include<mpi.h>
// this program will count the number of spawned processes in a *very* bad way
int main(int argc, char **argv)
{
int partial = 1;
int sum;
int my_id = 0;
// let's just assume the process with id 0 is root
int root_process = 0;
// spawn processes, etc.
MPI_Init(&argc,&argv);
// every process learns his id
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
// all processes add their 'partial' to the 'sum'
MPI_Reduce(&partial, &sum, 1, MPI_INT, MPI_SUM, root_process, MPI_COMM_WORLD);
// de-init MPI
MPI_Finalize();
// the root process communicates the summation result
if (my_id == root_process)
{
printf("Sum total : %d\n", sum);
}
return 0;
}

MPI with C: Passive RMA synchronization

as I found no answer for my question so far and am on the edge of going crazy about the problem, I just ask the question tormenting my mind ;-)
I'm working on a parallelization of a node-elimination algorithm I already programmed. Target environment is a cluster.
In my parallel program I distinguish on master process (in my case rank 0) and the working slaves (every rank except 0).
My idea is it, that the master is keeping track which slaves are available and send them then work. Therefore and for some other reasons I try to establish a workflow basing on passive RMA with lock-put-unlock sequences. I use an integer array named schedule in which for every position in the array representing a rank is either 0 for a working process or 1 for an available process (so if schedule[1]=1 one is available for work).
If a process is done with its work, it puts in the array on the master the 1 signalising its availability. The code I tried for that is as follows:
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0
printf("Process %d:\t exclusive lock on process 0 started\n",myrank);
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0
printf("Process %d:\t put operation called\n",myrank);
MPI_Win_unlock(0,win); // the window is unlocked
It worked perfectly, especially when the master process was synchronized with a barrier to the end of the lock because then the output of master was made after the put operation.
As a next step I tried to let master check on regular basis whether there are available slaves or not. Therefore I created a while-loop to repeat until every process signalized its availability (I repeat that it is program teaching me the principles, I know that the implementation still doesn't do what I want).
The loop is in a base variant just printing my array schedule and then checking in a function fnz whether there are other working processes than master:
while(j!=1){
printf("Process %d:\t following schedule evaluated:\n",myrank);
for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule
printf("\n");
j=fnz(schedule);
}
And then the concept blew up. After inverting the process and getting the required information with get from the slaves by the master instead of putting it with put from the slaves to the master I found out my main problem is the acquiring of the lock: the unlock command doesn't succeed because in the case of the put the lock isn't granted at all and in the case of the get the lock is only granted when the slave process is done with its work and waiting in a barrier. In my opinion there has to be a serious error in my thinking. It can't be the idea of passive RMA that the lock can only be achieved when the target process is in a barrier synchronizing the whole communicator. Then I could just go along with standard Send/Recv operations. What I want to achieve is, that process 0 is working all the time in delegating work and being able by RMA of the slaves to identify to whom it can delegate.
Can please someone help me and explain how I can get a break on process 0 to allow the other processes getting locks?
Thank you in advance!
UPDATE:
I'm not sure if you ever worked with a lock and just want to stress out that I'm perfectly able to get an updated copy of a remote memory window. If I get the availability from the slaves the lock is only granted when the slaves are waiting in a barrier. So what I got to work is, that process 0 performs lock-get-unlock while process 1 and 2 are simulating work such that process 2 is remarkably longer occupied than one. what I expect as a result is that process 0 prints a schedule (0,1,0) because process 0 isn't asked at all wether it's working, process 1 is done with working and process 2 is still working. In the next step, when process 2 is ready, I expect the output (0,1,1), since the slaves are both ready for new work. What I get is that the slaves only grant the lock for process 0 when they are waiting in a barrier, so that the first and only output I get at all is the last one I expect, showing me that the lock was granted for each individual process first, when it was done with its work. So if please someone could tell me when a lock can be granted by the target process instead of trying to confuse my knowledge about passive RMA, I would be very grateful
First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.
Achieving working and portable passive RMA with MPI-2 involves several steps.
Step 1: Window allocation in the target process
For portability and performance reasons the memory for the window should be allocated using MPI_ALLOC_MEM:
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
}
MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
...
MPI_Win_free(win);
MPI_Free_mem(schedule);
Step 2: Memory synchronisation at the target
The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):
It is erroneous to have concurrent conflicting accesses to the same memory location in a
window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.
Therefore each access to schedule[] in the target has to be protected by a lock (shared since it only reads the memory location):
while (!ready)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with --enable-opal-multi-threads (disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):
6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.
If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related
synchronization call.
This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of schedule[] if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):
Advice to users. A user can write correct programs by following the following rules:
...
lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are
protected by shared locks, both for local accesses and for RMA accesses.
Step 3: Providing the correct parameters to MPI_PUT at the origin
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created with disp_unit == sizeof(int) is:
int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
The local value of one is thus transferred into rank * sizeof(int) bytes following the beginning of the window at the target. If disp_unit was set to 1, the correct put would be:
MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);
Step 4: Dealing with implementation specifics
The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The osc (one-sided communication) framework comes in two implementations - rdma and pt2pt. The default (in Open MPI 1.6.x and probably earlier) is rdma and for some reason it does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER in your case). On the other hand the pt2pt module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select the pt2pt component:
$ mpiexec --mca osc pt2pt ...
A fully working C99 sample code follows:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
static double starttime = -1.0;
int diff = 0;
for (int i = 0; i < size; i++)
diff |= (schedule[i] != oldschedule[i]);
if (diff)
{
int res = 0;
if (starttime < 0.0) starttime = MPI_Wtime();
printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
for (int i = 0; i < size; i++)
{
printf("\t%d", schedule[i]);
res += schedule[i];
oldschedule[i] = schedule[i];
}
printf("\n");
return(res == size-1);
}
return 0;
}
int main (int argc, char **argv)
{
MPI_Win win;
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
int *oldschedule = malloc(size * sizeof(int));
// Use MPI to allocate memory for the target window
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
oldschedule[i] = -1;
}
// Create a window. Set the displacement unit to sizeof(int) to simplify
// the addressing at the originator processes
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
int ready = 0;
while (!ready)
{
// Without the lock/unlock schedule stays forever filled with 0s
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
printf("All workers checked in using RMA\n");
// Release the window
MPI_Win_free(&win);
// Free the allocated memory
MPI_Free_mem(schedule);
free(oldschedule);
printf("Master done\n");
}
else
{
int one = 1;
// Worker processes do not expose memory in the window
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// Simulate some work based on the rank
sleep(2*rank);
// Register with the master
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
MPI_Win_unlock(0, win);
printf("Worker %d finished RMA\n", rank);
// Release the window
MPI_Win_free(&win);
printf("Worker %d done\n", rank);
}
MPI_Finalize();
return 0;
}
Sample output with 6 processes:
$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule: 0 0 0 0 0 0
[ 1.995] Schedule: 0 1 0 0 0 0
Worker 1 finished RMA
[ 3.989] Schedule: 0 1 1 0 0 0
Worker 2 finished RMA
[ 5.988] Schedule: 0 1 1 1 0 0
Worker 3 finished RMA
[ 7.995] Schedule: 0 1 1 1 1 0
Worker 4 finished RMA
[ 9.988] Schedule: 0 1 1 1 1 1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done
The answer by Hristo Lliev works perfectly if I use newer versions of the Open-MPI library.
However, on the cluster we are currently using, this is not possible and for the older versions there was deadlock behavior for the final unlock calls, as described by Hhristo. Adding the options --mca osc pt2pt did solve the deadlock in a sense but the MPI_Win_unlock calls still didn't seem to complete until the process owning the accessed variable did its own lock/unlock of the window. This is not very useful when you have jobs with very different completion times.
Therefore from a pragmatical point of view, though strictly speaking leaving the topic of passive RMA synchronization (for which I do apologize), I would like to point out a workaround which makes use of external files for those who are stuck with using old versions of the Open-MPI library so they don't have to loose so much time as I did:
You basically create an external file containing the information about which (slave) process does which job, instead of an internal array. This way, you don't even have to have a master process only dedicated to the bookkeeping of the slaves: It can also perform a job. Anyway, every process can go look in this file which job is to be done next and possibly determine that everything is done.
The important point is now that this information file is not accessed at the same time by multiple processes, as this might cause work to be duplicated or worse. The equivalent of the locking and unlocking of the window in MPI is here imitated easiest by using a locking file: This file is created by the process currently accessing the information file. The other processes have to wait for the current process to finish by checking with a slight time delay whether the lock file still exists.
The full information can be found here.

Resources