MPI with C: Passive RMA synchronization - c

as I found no answer for my question so far and am on the edge of going crazy about the problem, I just ask the question tormenting my mind ;-)
I'm working on a parallelization of a node-elimination algorithm I already programmed. Target environment is a cluster.
In my parallel program I distinguish on master process (in my case rank 0) and the working slaves (every rank except 0).
My idea is it, that the master is keeping track which slaves are available and send them then work. Therefore and for some other reasons I try to establish a workflow basing on passive RMA with lock-put-unlock sequences. I use an integer array named schedule in which for every position in the array representing a rank is either 0 for a working process or 1 for an available process (so if schedule[1]=1 one is available for work).
If a process is done with its work, it puts in the array on the master the 1 signalising its availability. The code I tried for that is as follows:
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0
printf("Process %d:\t exclusive lock on process 0 started\n",myrank);
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0
printf("Process %d:\t put operation called\n",myrank);
MPI_Win_unlock(0,win); // the window is unlocked
It worked perfectly, especially when the master process was synchronized with a barrier to the end of the lock because then the output of master was made after the put operation.
As a next step I tried to let master check on regular basis whether there are available slaves or not. Therefore I created a while-loop to repeat until every process signalized its availability (I repeat that it is program teaching me the principles, I know that the implementation still doesn't do what I want).
The loop is in a base variant just printing my array schedule and then checking in a function fnz whether there are other working processes than master:
while(j!=1){
printf("Process %d:\t following schedule evaluated:\n",myrank);
for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule
printf("\n");
j=fnz(schedule);
}
And then the concept blew up. After inverting the process and getting the required information with get from the slaves by the master instead of putting it with put from the slaves to the master I found out my main problem is the acquiring of the lock: the unlock command doesn't succeed because in the case of the put the lock isn't granted at all and in the case of the get the lock is only granted when the slave process is done with its work and waiting in a barrier. In my opinion there has to be a serious error in my thinking. It can't be the idea of passive RMA that the lock can only be achieved when the target process is in a barrier synchronizing the whole communicator. Then I could just go along with standard Send/Recv operations. What I want to achieve is, that process 0 is working all the time in delegating work and being able by RMA of the slaves to identify to whom it can delegate.
Can please someone help me and explain how I can get a break on process 0 to allow the other processes getting locks?
Thank you in advance!
UPDATE:
I'm not sure if you ever worked with a lock and just want to stress out that I'm perfectly able to get an updated copy of a remote memory window. If I get the availability from the slaves the lock is only granted when the slaves are waiting in a barrier. So what I got to work is, that process 0 performs lock-get-unlock while process 1 and 2 are simulating work such that process 2 is remarkably longer occupied than one. what I expect as a result is that process 0 prints a schedule (0,1,0) because process 0 isn't asked at all wether it's working, process 1 is done with working and process 2 is still working. In the next step, when process 2 is ready, I expect the output (0,1,1), since the slaves are both ready for new work. What I get is that the slaves only grant the lock for process 0 when they are waiting in a barrier, so that the first and only output I get at all is the last one I expect, showing me that the lock was granted for each individual process first, when it was done with its work. So if please someone could tell me when a lock can be granted by the target process instead of trying to confuse my knowledge about passive RMA, I would be very grateful

First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.
Achieving working and portable passive RMA with MPI-2 involves several steps.
Step 1: Window allocation in the target process
For portability and performance reasons the memory for the window should be allocated using MPI_ALLOC_MEM:
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
}
MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
...
MPI_Win_free(win);
MPI_Free_mem(schedule);
Step 2: Memory synchronisation at the target
The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):
It is erroneous to have concurrent conflicting accesses to the same memory location in a
window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.
Therefore each access to schedule[] in the target has to be protected by a lock (shared since it only reads the memory location):
while (!ready)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with --enable-opal-multi-threads (disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):
6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.
If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related
synchronization call.
This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of schedule[] if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):
Advice to users. A user can write correct programs by following the following rules:
...
lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are
protected by shared locks, both for local accesses and for RMA accesses.
Step 3: Providing the correct parameters to MPI_PUT at the origin
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created with disp_unit == sizeof(int) is:
int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
The local value of one is thus transferred into rank * sizeof(int) bytes following the beginning of the window at the target. If disp_unit was set to 1, the correct put would be:
MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);
Step 4: Dealing with implementation specifics
The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The osc (one-sided communication) framework comes in two implementations - rdma and pt2pt. The default (in Open MPI 1.6.x and probably earlier) is rdma and for some reason it does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER in your case). On the other hand the pt2pt module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select the pt2pt component:
$ mpiexec --mca osc pt2pt ...
A fully working C99 sample code follows:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
static double starttime = -1.0;
int diff = 0;
for (int i = 0; i < size; i++)
diff |= (schedule[i] != oldschedule[i]);
if (diff)
{
int res = 0;
if (starttime < 0.0) starttime = MPI_Wtime();
printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
for (int i = 0; i < size; i++)
{
printf("\t%d", schedule[i]);
res += schedule[i];
oldschedule[i] = schedule[i];
}
printf("\n");
return(res == size-1);
}
return 0;
}
int main (int argc, char **argv)
{
MPI_Win win;
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
int *oldschedule = malloc(size * sizeof(int));
// Use MPI to allocate memory for the target window
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
oldschedule[i] = -1;
}
// Create a window. Set the displacement unit to sizeof(int) to simplify
// the addressing at the originator processes
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
int ready = 0;
while (!ready)
{
// Without the lock/unlock schedule stays forever filled with 0s
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
printf("All workers checked in using RMA\n");
// Release the window
MPI_Win_free(&win);
// Free the allocated memory
MPI_Free_mem(schedule);
free(oldschedule);
printf("Master done\n");
}
else
{
int one = 1;
// Worker processes do not expose memory in the window
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// Simulate some work based on the rank
sleep(2*rank);
// Register with the master
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
MPI_Win_unlock(0, win);
printf("Worker %d finished RMA\n", rank);
// Release the window
MPI_Win_free(&win);
printf("Worker %d done\n", rank);
}
MPI_Finalize();
return 0;
}
Sample output with 6 processes:
$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule: 0 0 0 0 0 0
[ 1.995] Schedule: 0 1 0 0 0 0
Worker 1 finished RMA
[ 3.989] Schedule: 0 1 1 0 0 0
Worker 2 finished RMA
[ 5.988] Schedule: 0 1 1 1 0 0
Worker 3 finished RMA
[ 7.995] Schedule: 0 1 1 1 1 0
Worker 4 finished RMA
[ 9.988] Schedule: 0 1 1 1 1 1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done

The answer by Hristo Lliev works perfectly if I use newer versions of the Open-MPI library.
However, on the cluster we are currently using, this is not possible and for the older versions there was deadlock behavior for the final unlock calls, as described by Hhristo. Adding the options --mca osc pt2pt did solve the deadlock in a sense but the MPI_Win_unlock calls still didn't seem to complete until the process owning the accessed variable did its own lock/unlock of the window. This is not very useful when you have jobs with very different completion times.
Therefore from a pragmatical point of view, though strictly speaking leaving the topic of passive RMA synchronization (for which I do apologize), I would like to point out a workaround which makes use of external files for those who are stuck with using old versions of the Open-MPI library so they don't have to loose so much time as I did:
You basically create an external file containing the information about which (slave) process does which job, instead of an internal array. This way, you don't even have to have a master process only dedicated to the bookkeeping of the slaves: It can also perform a job. Anyway, every process can go look in this file which job is to be done next and possibly determine that everything is done.
The important point is now that this information file is not accessed at the same time by multiple processes, as this might cause work to be duplicated or worse. The equivalent of the locking and unlocking of the window in MPI is here imitated easiest by using a locking file: This file is created by the process currently accessing the information file. The other processes have to wait for the current process to finish by checking with a slight time delay whether the lock file still exists.
The full information can be found here.

Related

Multi-threads program architecture C pthread

I'd like to create multi-threads program in C (Linux) with:
Infinite loop with infinite number of tasks
One thread per one task
Limit the total number of threads, so if for instance total threads number is more then MAX_THREADS_NUMBER, do sleep(), until total threads number become less then MAX_THREADS_NUMBER, continue after.
Resume: I need to do infinite number of tasks(one task per one thread) and I'd like to know how to implement it using pthreads in C.
Here is my code:
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THREADS 50
pthread_t thread[MAX_THREADS];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
pthread_mutex_lock(&lock);
counter += 1;
printf("Job %d started\n", counter);
pthread_mutex_unlock(&lock);
return NULL;
}
int main(void)
{
int i = 0;
int ret;
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
// Wait all threads to finish
for (i = 0; i < MAX_THREADS; i++) {
pthread_join(thread[i], NULL);
}
pthread_mutex_destroy(&lock);
return 0;
}
How to make this loop infinite?
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
I need something like this:
while (1) {
if (thread_number > MAX_THREADS_NUMBER)
sleep(1);
ret = pthread_create(...);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
Your current program is based on a simple dispatch design: the initial thread creates worker threads, assigning each one a task to perform. Your question is, how you make this work for any number of tasks, any number of worker threads. The answer is, you don't: your chosen design makes such a modification basically impossible.
Even if I were to answer your stated questions, it would not make the program behave the way you'd like. It might work after a fashion, but it'd be like a bicycle with square wheels: not very practical, nor robust -- not even fun after you stop laughing at how silly it looks.
The solution, as I wrote in a comment to the original question, is to change the underlying design: from a simple dispatch to a thread pool approach.
Implementing a thread pool requires two things: First, is to change your viewpoint from starting a thread and having it perform a task, to each thread in the "pool" grabbing a task to perform, and returning to the "pool" after they have performed it. Understanding this is the hard part. The second part, implementing a way for each thread to grab a new task, is simple: this typically centers around a data structure, protected with locks of some sort. The exact data structure does depend on what the actual work to do is, however.
Let's assume you wanted to parallelize the calculation of the Mandelbrot set (or rather, the escape time, or the number of iterations needed before a point can be ruled to be outside the set; the Wikipedia page contains pseudocode for exactly this). This is one of the "embarrassingly parallel" problems; those where the sub-problems (here, each point) can be solved without any dependencies.
Here's how I'd do the core of the thread pool in this case. First, the escape time or iteration count needs to be recorded for each point. Let's say we use an unsigned int for this. We also need the number of points (it is a 2D array), a way to calculate the complex number that corresponds to each point, plus some way to know which points have either been computed, or are being computed. Plus mutually exclusive locking, so that only one thread will modify the data structure at once. So:
typedef struct {
int x_size, y_size;
size_t stride;
double r_0, i_0;
double r_dx, i_dx;
double r_dy, i_dy;
unsigned int *iterations;
sem_t done;
pthread_mutex_t mutex;
int x, y;
} fractal_work;
When an instance of fractal_work is constructed, x_size and y_size are the number of columns and rows in the iterations map. The number of iterations (or escape time) for point x,y is stored in iterations[x+y*stride]. The real part of the complex coordinate for that point is r_0 + x*r_dx + y*r_dy, and imaginary part is i_0 + x*i_dx + y*i_dy (which allows you to scale and rotate the fractal freely).
When a thread grabs the next available point, it first locks the mutex, and copies the x and y values (for itself to work on). Then, it increases x. If x >= x_size, it resets x to zero, and increases y. Finally, it unlocks the mutex, and calculates the escape time for that point.
However, if x == 0 && y >= y_size, the thread posts on the done semaphore and exits, letting the initial thread know that the fractal is complete. (The initial thread just needs to call sem_wait() once for each thread it created.)
The thread worker function is then something like the following:
void *fractal_worker(void *data)
{
fractal_work *const work = (fractal_work *)data;
int x, y;
while (1) {
pthread_mutex_lock(&(work->mutex));
/* No more work to do? */
if (work->x == 0 && work->y >= work->y_size) {
sem_post(&(work->done));
pthread_mutex_unlock(&(work->mutex));
return NULL;
}
/* Grab this task (point), advance to next. */
x = work->x;
y = work->y;
if (++(work->x) >= work->x_size) {
work->x = 0;
++(work->y);
}
pthread_mutex_unlock(&(work->mutex));
/* z.r = work->r_0 + (double)x * work->r_dx + (double)y * work->r_dy;
z.i = work->i_0 + (double)x * work->i_dx + (double)y * work->i_dy;
TODO: implement the fractal iteration,
and count the iterations (say, n)
save the escape time (number of iterations)
in the work->iterations array; e.g.
work->iterations[(size_t)x + work->stride*(size_t)y] = n;
*/
}
}
The program first creates the fractal_work data structure for the worker threads to work on, initializes it, then creates some number of threads giving each thread the address of that fractal_work structure. It can then call fractal_worker() itself too, to "join the thread pool". (This pool automatically "drains", i.e. threads will return/exit, when all points in the fractal are done.)
Finally, the main thread calls sem_wait() on the done semaphore, as many times as it created worker threads, to ensure all the work is done.
The exact fields in the fractal_work structure above do not matter. However, it is at the very core of the thread pool. Typically, there is at least one mutex or rwlock protecting the work details, so that each worker thread gets unique work details, as well as some kind of flag or condition variable or semaphore to let the original thread know that the task is now complete.
In a multithreaded server, there is usually only one instance of the structure (or variables) describing the work queue. It may even contain things like minimum and maximum number of threads, allowing the worker threads to control their own number to dynamically respond to the amount of work available. This sounds magical, but is actually simple to implement: when a thread has completed its work, or is woken up in the pool with no work, and is holding the mutex, it first examines how many queued jobs there are, and what the current number of worker threads is. If there are more than the minimum number of threads, and no work to do, the thread reduces the number of threads, and exits. If there are less than the maximum number of threads, and there is a lot of work to do, the thread first creates a new thread, then grabs the next task to work on. (Yes, any thread can create new threads into the process. They are all on equal footing, too.)
A lot of the code in a practical multithreaded application using one or more thread pools to do work, is some sort of bookkeeping. Thread pool approaches very much concentrates on the data, and the computation needed to be performed on the data. I'm sure there are much better examples of thread pools out there somewhere; the hard part is to think of a good task for the application to perform, as the data structures are so task-dependent, and many computations are so simple that parallelizing them makes no sense (since creating new threads does have a small computational cost, it'd be silly to waste time creating threads when a single thread does the same work in the same or less time).
Many tasks that benefit from parallelization, on the other hand, require information to be shared between workers, and that requires a lot of thinking to implement correctly. (For example, although solutions exist for parallelizing molecular dynamics simulations efficiently, most simulators still calculate and exchange data in separate steps, rather than at the same time. It's just that hard to do right, you see.)
All this means that you cannot expect to be able to write the code, unless you understand the concept. Indeed, truly understanding the concepts are the hard part: writing the code is comparatively easy.
Even in the above example, there are certain tripping points: Does the order of posting the semaphore and releasing the mutex matter? (Well, it depends on what the thread that is waiting for the fractal to complete does -- and indeed, if it is waiting yet.) If it was a condition variable instead of a semaphore, it would be essential that the thread that is interested in the fractal completion is waiting on the condition variable, otherwise it would miss the signal/broadcast. (This is also why I used a semaphore.)

CHIP8 in C - How to correctly handle the delay timer?

TL;DR I need to emulate a timer in C that allows concurrent writes and reads, whilst preserving constant decrements at 60 Hz (not exactly, but approximately accurate). It will be part of a Linux CHIP8 emulator. Using a thread-based approach with shared memory and semaphores raises some accuracy problems, as well as race conditions depending on how the main thread uses the timer.
Which is the best way to devise and implement such a timer?
I am writing a Linux CHIP8 interpreter in C, module by module, in order to dive into the world of emulation.
I want my implementation to be as accurate as possible with the specifications. In that matter, timers have proven to be the most difficult modules for me.
Take for instance the delay timer. In the specifications, it is a "special" register, initally set at 0. There are specific opcodes that set a value to, and get it from the register.
If a value different from zero is entered into the register, it will automatically start decrementing itself, at a frequency of 60 Hz, stopping once zero is reached.
My idea regarding its implementation consists of the following:
The use of an ancillary thread that does the decrementing automatically, at a frequency of nearly 60 Hz by using nanosleep(). I use fork() to create the thread for the time being.
The use of shared memory via mmap() in order to allocate the timer register and store its value on it. This approach allows both the ancillary and the main thread to read from and write to the register.
The use of a semaphore to synchronise the access for both threads. I use sem_open() to create it, and sem_wait() and sem_post() to lock and unlock the shared resource, respectively.
The following code snippet illustrates the concept:
void *p = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
/* Error checking here */
sem_t *mutex = sem_open("timersem", O_CREAT, O_RDWR, 1);
/* Error checking and unlinking */
int *val = (int *) p;
*val = 120; // 2-second delay
pid_t pid = fork();
if (pid == 0) {
// Child process
while (*val > 0) { // Possible race condition
sem_wait(mutex); // Possible loss of frequency depending on main thread code
--(*val); // Safe access
sem_post(mutex);
/* Here it goes the nanosleep() */
}
} else if (pid > 0) {
// Parent process
if (*val == 10) { // Possible race condition
sem_wait(mutex);
*val = 50; // Safe access
sem_post(mutex);
}
}
A potential problem I see with such implementation relies on the third point. If a program happens to update the timer register once it has reached a value different from zero, then the ancillary thread must not wait for the main thread to unlock the resource, or else the 60 Hz delay will not be fulfilled. This implies both threads may freely update and/or read the register (constant writes in the case of the ancillary thread), which obviously introduces race conditions.
Once I have explained what I am doing and what I try to achieve, my question is this:
Which is the best way to devise and emulate a timer that allows concurrent writes and reads, whilst preserving an acceptable fixed frequency?
Don't use threads and synchronization primitives (semaphores, shared memory, etc) for this. In fact, I'd go as far as to say: don't use threads for anything unless you explicitly need multi-processor concurrency. Synchronization is difficult to get right, and even more difficult to debug when you get it wrong.
Instead, figure out a way to implement this in a single thread. I'd recommend one of two approaches:
Keep track of the time the last value was written to the timer register. When reading from the register, calculate how long ago it was written to, and subtract an appropriate value from the result.
Keep track of how many instructions are being executed overall, and subtract 1 from the timer register every N instructions, where N is a large number such that N instructions is about 1/60 second.

How to call same rank in MPI using C language?

I am trying to learn MPI programing and have written the following program. It adds an entire row of array and outputs the sum. At rank 0 (or process 0), it will call all its slave ranks to do the calculation. I want to do this using only two other slave ranks/process. Whenever I try to invoke same rank twice as shown in the code bellow, my code would just hang in the middle and wouldn't execute. If I don't call the same rank twice, code would work correctly
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int tag2 = 1;
int arr[30] = {0};
MPI_Request request;
MPI_Status status;
printf ("\n--Current Rank: %d\n", world_rank);
int index;
int source = 0;
int dest;
if (world_rank == 0)
{
int i;
printf("* Rank 0 excecuting\n");
index = 0;
dest = 1;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = i + 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 0;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2; //Problem happens here when I try to call the same destination(or rank 2) twice
//If I change this dest value to 3 and run using: mpirun -np 4 test, this code will work correctly
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
}
else
{
int sum = 0;
int i;
MPI_Irecv(&arr[0], 30, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<30; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
}
MPI_Finalize();
}
Result when using: mpirun -np 3 test
--Current Rank: 2
Sum is: 0 at rank: 2
--Current Rank: 0
* Rank 0 excecuting
--Current Rank: 1
Sum is: 524800 at rank: 1
//Program hangs here and wouldn't show sum of 30
Please let me know how can I call the same rank twice. For example, if I only had two other slave process that I can call.
Please show with an example if possible
In MPI, each process executes the same code and, as you're doing, you differentiate the different processes primarily through checking rank in if/else statements. The master process with rank 0 is doing 3 sends: a send to process 1, then two sends to process 2. The slave processes each do only one receive, which means that rank 1 receives its first message and rank 2 receives its first message. When you call the third MPI_Send on process 0, there is not and will not be any slave waiting to receive the message after that point, as the slaves have finished executing their else block. The program gets blocked as the master waits to send the final message.
In order to fix that, you have to make sure the slave of rank 2 performs two receives, either by adding a loop for that process only or by repeating for that process only (so, using an if(world_rank == 2) check) the block of code
sum = 0; //resetting sum
MPI_Irecv(&arr[0], 1024, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<1024; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
TL;DR: Just a remark that master/slave approach isn't to be favoured and might format programmer's mind durably, leading to poor code when put in production
Although Clarissa is perfectly correct and her answer is very clear, I'd like to add a few general remarks, not about the code itself, but about parallel computing philosophy and good habits.
First a quick preamble: when one wants to parallelise one's code, it can be for two main reasons: making it faster and/or permitting it to handle larger problems by overcoming the limitations (like memory limitation) found on a single machine. But in all cases, performance matters and I will always assume that MPI (or generally speaking parallel) programmers are interested and concerned by performance. So the rest of my post will suppose that you are.
Now the main reason of this post: over the past few days, I've seen a few questions here on SO about MPI and parallelisation, obviously coming from people eager to learn MPI (or OpenMP for that matter). This is great! Parallel programming is great and there'll never be enough parallel programmers. So I'm happy (and I'm sure many SO members are too) to answer questions helping people to learn how to program in parallel. And in the context of learning how to program in parallel, you have to write some simple codes, doing simple things, in order to understand what the API does and how it works. These programs might look stupid from the distance and very ineffective, but that's fine, that's how leaning works. Everybody learned this way.
However, you have to keep in mind that these programs you write are only that: API learning exercises. They're not the real thing and they do not reflect the philosophy of what an actual parallel program is or should be. And what drives my answer here is that I've seen here and in other questions and answers put forward recurrently the status of "master" process and "slaves" ones. And that is wrong, fundamentally wrong! Let me explain why:
As Clarissa perfectly pinpointed, "in MPI, each process executes the same code". The idea is to find a way of making several processes to interact for working together in solving a (possibly larger) problem (hopefully faster). But amongst these processes, none gets any special status, they are all equal. They are given an id to be able to address them, but rank 0 is no better than rank 1 or rank 1025... By artificially deciding that process #0 is the "master" and the others are its "slaves", you break this symmetry and it has consequences:
Now that rank #0 is the master, it commands, right? That's what a master does. So it will be the one getting the information necessary to run the code, will distribute it's share to the workers, will instruct them to do the processing. Then it will wait for the processing to be concluded (possibly getting itself busy in-between but more likely just waiting or poking the workers, since that's what a master does), collect the results, do a bit of reassembling and output it. Job done! What's wrong with that?
Well, the following are wrong:
During the time the master gets the data, slaves are sitting idle. This is sequential and ineffective processing...
Then the distribution of the data and the work to do implies a lot of transfers. This takes time and since it is solely between process #0 and all the others, this might create a lot of congestions on the network in one single link.
While workers do their work, what should the master do? Working as well? If yes, then it might not be readily available to handle requests from the slaves when they come, delaying the whole parallel processing. Waiting for these requests? Then it wastes a lot of computing power by sitting idle... Ultimately, there is no good answer.
Then points 1 and 2 are repeated in reverse order, with the gathering of the results and outputting or results. That's a lot of data transfers and sequential processing, which will badly damage the global scalability, effectiveness and performance.
So I hope you now see why master/slaves approach is (usually, not always but very often) wrong. And the danger I see from the questions and answers I've read of the past days is that you might get you mind formatted in this approach as if it was the "normal" way of thinking in parallel. Well, it is not! Parallel programming is symmetry. It is handling the problem globally, in all places at the same time. You have to think parallel from the start and see your code as a global parallel entity, not just a brunch of processes that need to be instructed on what to do. Each process is it's own master, dealing with it's peers on an equal ground. Each process should (as much as possible) acquire its data by itself (making it a parallel processing); decide what to do based on the number of peers involved in the processing and its id; exchange information with its peers when necessary, should it be locally (point to point communications) or globally (collective communications); and issue its own share of the result (again leading to parallel processing)...
OK, that's a bit extreme a requirement for people just starting to learn parallel programming and I want by no mean to tell you that your learning exercises should be like this. But keep the goal in mind and don't forget that API learning examples are only API learning examples, not reduced models of actual codes. So keep on experimenting with MPI calls to understand what they do and how they work, but try to slowly tend towards symmetrical approach on your examples. That can only be beneficial for you in the long term.
Sorry for this lengthy and somewhat off topic answer, and good luck with your parallel programming.

4 Process 4 way synchronization using semaphores (In a C Programming, UNIX environment)

I have a question about synchronizing 4 processes in a UNIX environment. It is very important that no process runs their main functionality without first waiting for the others to "be on the same page", so to speak.
Specifically, they should all not go into their loops without first synchronizing with each other. How do I synchronize 4 processes in a 4 way situation, so that none of them get into their first while loop without first waiting for the others? Note that this is mainly a logic problem, not a coding problem.
To keep things consistent between environments let's just say we have a pseudocode semaphore library with the operations semaphore_create(int systemID), semaphore_open(int semaID), semaphore_wait(int semaID), and semaphore_signal(int semaID).
Here is my attempt and subsequent thoughts:
Process1.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_create(123456); //123456 is an arbitrary ID for the semaphore.
int sem2 = semaphore_create(78901); //78901 is an arbitrary ID for the semaphore.
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process, etc (not really relevant)...
}
}
Process2.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process etc...
}
}
Process3.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process etc...
}
}
Process4.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem2);
semaphore_signal(sem2);
semaphore_signal(sem2);
semaphore_wait(sem1);
semaphore_wait(sem1);
semaphore_wait(sem1);
while(true) {
//...do main functionality of process etc...
}
}
We run Process1 first, and it creates all of the semaphores into system memory used in the other processes (the other processes simply call semaphore_open to gain access to those semaphores). Then, all 4 processes have a signal operation, and then a wait. The signal operation causes process1, process2, and process3 to increment the value of sem1 by 1, so it's resultant maximum value is 3 (depending on what order the operating system decides to run these processes in). Process1, 2, and 3, are all waiting then on sem2, and process4 is waiting on sem1 as well. Process 4 then signals sem2 3 times to bring its value back up to 0, and waits on sem1 3 times. Since sem1 was a maximum of 3 from the signalling in the other processes (depending on what order they ran in, again), then it will bring its value back up to 0, and continue running. Thus, all processes will be synchronized.
So yea, not super confident on my answer. I feel that it depends heavily on what order the processes ran in, which is the whole point of synchronization -- that it shouldn't matter what order they run in, they all synchronize correctly. Also, I am doing a lot of work in Process4. Maybe it would be better to solve this using more than 2 semaphores? Wouldn't this also allow for more flexibility within the loops in each process, if I want to do further synchronization?
My question: Please explain why the above logic will or will not work, and/or a solution on how to solve this problem of 4 way synchronization. I'd imagine this is a very common thing to have to think about depending on the industry (eg. banking and synching up bank accounts). I know it is not very difficult, but I have never worked with semaphores before, so I'm kind of confused on how they work.
The precise semantics of your model semaphore library are not clear enough to answer your question definitively. However, if the difference between semaphore_create() and semaphore_open() is that the latter requires the specified semaphore to already exist, whereas the former requires it to not exist, then yes, the whole thing will fall down if process1 does not manage to create the needed semaphores before any of the other processes attempt to open them. (Probably it falls down in different ways if other semantics hold.)
That sort of issue can be avoided in a threading scenario because with threads there is necessarily an initial single-threaded segment wherein the synchronization structures can be initialized. There is also shared memory by which the various threads can communicate with one another. The answer #Dark referred to depends on those characteristics.
The essential problem with a barrier for multiple independent processes -- or for threads that cannot communicate via shared memory and that are not initially synchronized -- is that you cannot know which process needs to erect the barrier. It follows that each one needs to be prepared to do so. That can work in your model library if semaphore_create() can indicate to the caller which result was achieved, one of
semaphore successfully created
semaphore already exists
(or error)
In that case, all participating processes (whose number you must know) can execute the same procedure, maybe something like this:
void process_barrier(int process_count) {
sem_t *sem1, *sem2, *sem3;
int result = semaphore_create(123456, &sem1);
int counter;
switch (result) {
case SEM_SUCCESS:
/* I am the controlling process */
/* Finish setting up the barrier */
semaphore_create(78901, &sem2);
semaphore_create(23432, &sem3);
/* let (n - 1) other processes enter the barrier... */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_signal(sem1);
}
/* ... and wait for those (n - 1) processes to do so */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_wait(sem2);
}
/* let all the (n - 1) waiting processes loose */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_signal(sem3);
}
/* and I get to continue, too */
break;
case SEM_EXISTS_ERROR:
/* I am NOT the controlling process */
semaphore_open(123456, &sem1);
/* wait, if necessary, for the barrier to be initialized */
semaphore_wait(sem1);
semaphore_open(78901, &sem2);
semaphore_open(23432, &sem3);
/* signal the controlling process that I have reached the barrier */
semaphore_signal(sem2);
/* wait for the controlling process to allow me to continue */
semaphore_wait(sem3);
break;
}
}
Obviously, I have taken some minor liberties with your library interface, and I have omitted error checks except where they bear directly on the barrier's operation.
The three semaphores involved in that example serve distinct, well-defined purposes. sem1 guards the initialization of the synchronization constructs and allows the processes to choose which among them takes responsibility for controlling the barrier. sem2 serves to count how many processes have reached the barrier. sem3 blocks the non-controlling processes that have reached the barrier until the controlling process releases them all.

What will happen when root in MPI_Bcast is random

Usually when we call MPI_Bcast, the root must be decided. But now I have an application that does not care who will broadcast the message. It means the node that will broadcast the message is random, but once the message is broadcasted, a global variable must be consistent. My understanding is that since MPI_Bcast is a collective function, all nodes need to call it but the order may be different. So who arrives at MPI_Bcast first, who will broadcast the message to others. I ran the following code with 3 nodes, I think if node 1 (rank==1) arrives at MPI_Bcast first, it will send the local_count value (1) to other nodes,and then all nodes update global_count with the same local_count, so one of my expected result is (the output order does not matter)
node 0, global count is 1
node 1, global count is 1
node 2, global count is 1
But the the actual result is always (the output order does not matter):
node 1, global count is 1
node 0, global count is 0
node 2, global count is 2
This result is exactly the same as the code without MPI_Bcast. So is there anything wrong with my understanding of MPI_Bcast or my code. Thanks.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int rank, size;
int local_count, global_count;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
printf("node %d, global count is: %d\n", rank, global_count);
MPI_Finalize();
}
The code is a simplified case. In my application, there are some computations before MPI_Bcast and I don't know who will finish the computation first. Whenever a node comes to MPI_Bcast point, it needs to broadcast its own computation result local variable and all nodes update the global variable. So all nodes need to broadcast a message but we don't know the order. How to implement this idea?
Normally what you have written will lead to deadlock. In a typical MPI_Bcast case the process with rank root sends its data to all other processes in the comunicator. You need to specify the same root rank in those receiving processes so they know whom to "listen" to. This is an oversimplified description since usually hierarchial broadcast is used in orderto reduce the total operation time but with three processes this hierarchial implementation reduces to the very simple linear one. In your case process with rank 0 will try to send the same message to both processes 1 and 2. In the same time process 1 would not receive that message but instead will try to send its one to processes 0 and 2. Process 2 will also be trying to send to 0 and 1. In the end every process will be sending messages that no other process will be willing to receive. This is an almost sure recipe for distaster.
Why your program doesn't hang? Because messages being sent are very small, only one MPI_INT element and also the number of processes is small, thus those sends are all being buffered internally by the MPI library in every process - they never reach their destinations but nevertheless the calls made internally by MPI_Bcast do not block and your code gets back execution control although the operation is still in progress. This is undefined behaviour since the MPI library is not required to buffer anything by the standard - some implementations might buffer, others might not.
If you are trying to compute the sum of all local_count variables in all processes, then just use MPI_Allreduce. Replace this:
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
with this:
local_count = rank;
MPI_Allreduce(&local_count, &global_count, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
The correct usage of MPI_BCast is that all processes call the function with the same root. Even if you don't care about who the broadcaster is, all the processes must call the function with the same rank to listen to the broadcaster. In your code, each process are calling MPI_BCast with their own rank and all different from each other.
I haven't look at the standard document, but it is likely that you trigger undefined behavior by calling MPI_BCast with different ranks.
from the openmpi docs
"MPI_Bcast broadcasts a message from the process with rank root to all processes of the group, itself included. It is called by all members of group using the same arguments for comm, root. On return, the contents of root’s communication buffer has been copied to all processes. "
calling it might from ranks != root might cause some problems
a safer way is to hard code your own broadcast function and call that
its basically a for loop and a mpi_send command and shouldn't be to difficult to implement your self

Resources