Wait until slaves called MPI_finalize - c

I have a problem with the following codes:
Master:
#include <iostream>
using namespace std;
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define PB1 1
#define PB2 1
int main (int argc, char *argv[])
{
int np[2] = { 2, 1 }, errcodes[2];
MPI_Comm parentcomm, intercomm;
char *cmds[2] = { "./slave", "./slave" };
MPI_Info infos[2] = { MPI_INFO_NULL, MPI_INFO_NULL };
MPI_Init(NULL, NULL);
#if PB1
for(int i = 0 ; i<2 ; i++)
{
MPI_Info_create(&infos[i]);
char hostname[] = "localhost";
MPI_Info_set(infos[i], "host", hostname);
}
#endif
MPI_Comm_spawn_multiple(2, cmds, MPI_ARGVS_NULL, np, infos, 0, MPI_COMM_WORLD, &intercomm, errcodes);
printf("c Creation of the workers finished\n");
#if PB2
sleep(1);
#endif
MPI_Comm_spawn_multiple(2, cmds, MPI_ARGVS_NULL, np, infos, 0, MPI_COMM_WORLD, &intercomm, errcodes);
printf("c Creation of the workers finished\n");
MPI_Finalize();
return 0;
}
Slave:
#include "mpi.h"
#include <stdio.h>
using namespace std;
int main( int argc, char *argv[])
{
int rank;
MPI_Init(0, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("rank = %d\n", rank);
MPI_Finalize();
return 0;
}
I do not know why when I run mpirun -np 1 ./master, my program stops with the following mesage when I set PB1 and PB2 to 1 (it works well when I set of of them to 0):
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application: ./slave Either
request fewer slots for your application, or make more slots available
for use.
For instance, when I set PB2 to 0, the program works well. Thus, I suppose that it is because the MPI_finalize does not finish its job ...
I googled, but I did not find any answer for my problem. I tried various things as: call MPI_comm_disconnect, add a barrier, ... but nothing worked.
I work on Ubuntu (15.10) and use the OpenMPI version 1.10.2.

The MPI_Finalize on the first set of salves will not finish until MPI_Finalize is called on the master. MPI_Finalize is collective over all connected processes. You can work around that by manually disconnecting the first batch of salves from the intercommunicator before calling MPI_Finalize. This way, the slaves will actually finish complete and exit - freeing the "slots" for the new batch of slaves. Unfortunately I don't see a standardized way to really ensure the slaves are finished in a sense that their slots are freed, because that would be implementation defined. The fact that OpenMPI freezes in the MPI_Comm_spawn_multiple instead of returning an error is unfortunate and one might consider that a bug. Anyway here is a draft of what you could do:
Within the master, each time is done with its slaves:
MPI_Barrier(&intercomm); // Make sure master and slaves are somewhat synchronized
MPI_Comm_disconnect(&intercomm);
sleep(1); // This is the ugly unreliable way to give the slaves some time to shut down
The slave:
MPI_Comm parent;
MPI_Comm_get_parent(&parent); // you should have that already
MPI_Comm_disconnect(&parent);
MPI_Finalize();
However, you still need to make sure OpenMPI knows how many slots should be reserved for the whole application (universe_size). You can do that for example with a hostfile:
localhost slots=4
And then mpirun -np 1 ./master.
Now this is not pretty and I would argue that your approach to dynamically spawning MPI workers isn't really what MPI is meant for. It may be supported by the standard, but that doesn't help you if implementations are struggling. However, there is not enough information on how you intend to communicate with the external processes to provide a cleaner, more ideomatic solution.
One last remark: Do check the return codes of MPI functions. Especially MPI_Comm_spawn_multiple.

Related

Linux: FIFO scheduler isn't working as expected

I am currently trying to work with Linux FIFO schedulers.
I want to run two processes: process A and process B with the same priority in a FIFO way.
To do this, I have made a shell script in which I run process A first, followed by process B. In FIFO format, process B should start its execution only after the completion of process A, i.e., there should be no overlapping between the execution of these processes. But this isn't what is happening.
I am actually observing that both of the processes are running in an overlapping fashion, i.e., the print statements are printing in both the processes in an interleaved format.
Here is the code of the shell script.
gcc -o process_a process_a.c
gcc -o process_b process_b.c
sudo taskset --cpu-list 0 chrt -f 50 ./process_a &
sleep 0.1
sudo taskset --cpu-list 0 chrt -f 50 ./process_b &
sleep 30
exit
To make sure that both the processes run on the same CPU, I have used taskset command. Also, I am using chrt command to set the scheduler.
Here is the code for process_a.c
#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <sched.h>
int main(int argc, char *argv[])
{
printf("Process A begins!\n");
fflush(stdout);
long long int i=0, m = 1e8;
while(i<2e10){
if(i%m == 0){
printf("Process A running\n");
fflush(stdout);
}
i++;
}
printf("Process A ended \n");
fflush(stdout);
return 0;
}
Here is the code for process_b.c
#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <sched.h>
int main(int argc, char *argv[])
{
printf("Process B begins!\n");
fflush(stdout);
long long int i=0, m = 1e8;
while(i<2e10){
if(i%m == 0){
printf("Process B running\n");
fflush(stdout);
}
i++;
}
printf("Process B ended \n");
fflush(stdout);
return 0;
}
Please help me understand why this is happening.
Many thanks in advance.
The sched(7) manual page says this:
A SCHED_FIFO thread runs until either it is blocked by an I/O request, it is preempted by a higher priority thread, or it calls sched_yield(2).
In this case, you're performing an I/O request to the terminal via printf, which calls write under the hood. Your file descriptor is in blocking mode, so it is likely that at least some blocking occurs in this case since generally I/O is slow, especially to terminals, causing the other process to run.
If you wanted a better example of one process preventing the other from running, you'd probably want to do something like write into a shared memory segment instead, which wouldn't be blocked by an I/O request.

Binary semaphore to maintain concurrency

I was trying to implement a multi-threaded program using binary semaphore. Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <semaphore.h>
int g = 0;
sem_t *semaphore;
void *myThreadFun(void *vargp)
{
int myid = (int)vargp;
static int s = 0;
sem_wait(semaphore);
++s; ++g;
printf("Thread ID: %d, Static: %d, Global: %d\n", myid, s, g);
fflush(stdout);
sem_post(semaphore);
pthread_exit(0);
}
int main()
{
int i;
pthread_t tid;
if ((semaphore = sem_open("/semaphore", O_CREAT, 0644, 3))==SEM_FAILED) {
printf("semaphore initialization failed\n");
}
for (i = 0; i < 3; i++) {
pthread_create(&tid, NULL, myThreadFun, (void *)i);
}
pthread_exit(NULL);
return 0;
}
Now, when I opened the sempahore, I made the count 3. I was expecting that this wouldnt work, and I would get race condition, because each thread is now capable of decrementing the count.
Is there something wrong with the implementation? Also, if I make the count 0 during sem_open, wouldnt that initiate a deadlock condition, because all the threads should be blocked on sem_wait.
Now, when I opened the sempahore, I made the count 3. I was expecting that this wouldnt work, and I would get race condition, because each thread is now capable of decrementing the count.
And how do you judge that there isn't any race? Observing output consistent with what you could rely upon in the absence of a data race in no way proves that there isn't a data race. It merely fails provide any evidence of one.
However, you seem to be suggesting that there would be a data race inherent in more than one thread concurrently performing a sem_wait() on a semaphore whose value is initially greater than 1 (otherwise which counter are you talking about?). But that's utter nonsense. You're talking about a semaphore. It's a synchronization object. Such objects and the functions that manipulate them are the basis for thread synchronization. They themselves are either completely thread safe or terminally buggy.
Now, you are correct that you open the semaphore with an initial count sufficient to avoid any of your threads blocking in sem_wait(), and that therefore they can all run concurrently in the whole body of myThreadFun(). You have not established, however, that they in fact do run concurrently. There are several reasons why they might not do. If they do run concurrently, then the incrementing of shared variables s and g is indeed of concern, but again, even if you see no signs of a data race, that doesn't mean there isn't one.
Everything else aside, the fact that your threads all call sem_wait(), sem_post(), and printf() induces some synchronization in the form of memory barriers, which would reduce the likelihood of observing anomalous effects on s and g. sem_wait() and sem_post() must contain memory barriers in order to function correctly, regardless of the semaphore's current count. printf() calls are required to use locking to protect the state of the stream from corruption in multi-threaded programs, and it is reasonable to suppose that this will require a memory barrier.
Is there something wrong with the implementation?
Yes. It is not properly synchronized. Initialize the semaphore with count 1 so that the modifications of s and g occur only while exactly one thread has the semaphore locked.
Also, if I make the count 0 during sem_open, wouldnt that initiate a deadlock condition, because all the threads should be blocked on sem_wait.
If the semaphore has count 0 before any of the additional threads are started, then yes. It is therefore inappropriate to open the semaphore with count 0, unless you also subsequently post to it before starting the threads. But you are using a named semaphore. These persist until removed, and you never remove it. The count you specify to sem_open() has no effect unless a new semaphore needs to be created; when an existing semaphore is opened, its count is unchanged.
Also, do have the main thread join all the others before it terminates. It's not inherently wrong not to do so, but in most cases it's required for the semantics you want.
I'll get to the code in a bit that proves you had a race condition. I'll add a couple of different ways to trigger it so you can see how this works. I'm doing this on Linux and passing -std=gnu99 as a param to gcc ie
gcc -Wall -pedantic -lpthread -std=gnu99 semaphore.c -o semtex
Note. In your original example (assuming Linux) one fatal mistake you had was not removing the semaphore. If you run the following command you might see some of these sitting around on your machine
ls -la /dev/shm/sem.*
You need to make sure that there are no old semaphore sitting on the filesystem before running your program or you end up picking up the last settings from the old semaphore. You need to use sem_unlink to clean up.
To run it please use the following.
rm /dev/shm/sem.semtex;./semtex
I'm deliberately making sure that the semaphore is not there before running because if you have a DEADLOCK it can be left around and that causes all sorts of problems when testing.
Now, when I opened the sempahore, I made the count 3. I was expecting
that this wouldn't work, and I would get race condition, because each
thread is now capable of decrementing the count.
You had a race condition but sometimes C is so damned fast things can appear to work because your program can get everything it needs done in the time the OS has allocated to the thread ie the OS didn't preempt it at an important point.
This is one of those cases, the race condition is there you just need to squint a little bit to see it. In the following code you can tweak some parameters to see the a deadlock, correct usage and Undefined Behavior.
#define INITIAL_SEMAPHORE_VALUE CORRECT
The value INITIAL_SEMAPHORE_VALUE can take three values...
#define DEADLOCK 0
#define CORRECT 1
#define INCORRECT 2
I hope they're self explanatory. You can also use two methods to cause the race condition to blow up the program.
#define METHOD sleep
Set the METHOD to spin and you can play with the SPIN_COUNT and find out how many times the loop can run before you actually see a problem, this is C, it can get a lot done before it gets preempted. The code has most of the rest of the information you need in it.
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <unistd.h>
#define DEADLOCK 0 // DEADLOCK
#define CORRECT 1 // CORRECT
#define INCORRECT 2 // INCORRECT
/*
* Change the following values to observe what happen.
*/
#define INITIAL_SEMAPHORE_VALUE CORRECT
//The next value provides to two different ways to trigger the problem, one
//using a tight loop and the other using a system call.
#define METHOD sleep
#if (METHOD == spin)
/* You need to increase the SPIN_COUNT to a value that's big enough that the
* kernel preempts the thread to see it fail. The value set here worked for me
* in a VM but might not work for you, tweak it. */
#define SPIN_COUNT 1000000
#else
/* The reason we can use such a small time for USLEEP is because we're making
* the kernel preempt the thread by using a system call.*/
#define USLEEP_TIME 1
#endif
#define TOT_THREADS 10
static int g = 0;
static int ret = 1729;
sem_t *semaphore;
void *myThreadFun(void *vargp) {
int myid = (int)vargp;
int w = 0;
static int s = 0;
if((w = sem_wait(semaphore)) != 0) {
fprintf(stderr, "Error: %s\n", strerror(errno));
abort();
};
/* This is the interesting part... Between updating `s` and `g` we add
* a delay using one of two methods. */
s++;
#if ( METHOD == spin )
int spin = 0;
while(spin < SPIN_COUNT) {
spin++;
}
#else
usleep(USLEEP_TIME);
#endif
g++;
if(s != g) {
fprintf(stderr, "Fatal Error: s != g in thread: %d, s: %d, g: %d\n", myid, s, g);
abort();
}
printf("Thread ID: %d, Static: %d, Global: %d\n", myid, s, g);
// It's a false sense of security if you think the assert will fail on a race
// condition when you get the params to sem_open wrong It might not be
// detected.
assert(s == g);
if((w = sem_post(semaphore)) != 0) {
fprintf(stderr, "Error: %s\n", strerror(errno));
abort();
};
return &ret;
}
int main(void){
int i;
void *status;
const char *semaphore_name = "semtex";
pthread_t tids[TOT_THREADS];
if((semaphore = sem_open(semaphore_name, O_CREAT, 0644, INITIAL_SEMAPHORE_VALUE)) == SEM_FAILED) {
fprintf(stderr, "Fatal Error: %s\n", strerror(errno));
abort();
}
for (i = 0; i < TOT_THREADS; i++) {
pthread_create(&tids[i], NULL, myThreadFun, (void *) (intptr_t) i);
}
for (i = 0; i < TOT_THREADS; i++) {
pthread_join(tids[i], &status);
assert(*(int*)status == 1729);
}
/*The following line was missing from your original code*/
sem_unlink(semaphore_name);
pthread_exit(0);
}

Barrier call stuck in Open MPI (C program)

I am practicing synchronization through barrier by using Open MPI message communication. I have created an array of struct called containers. Each container is linked to its neighbor on the right, and the two elements at both ends are also linked, forming a circle.
In the main() testing client, I run MPI with multiple processes (mpiexec -n 5 ./a.out), and they are supposed to be synchronized by calling the barrier() function, however, my code is stuck at the last process. I am looking for help with the debugging. Please see my code below:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <mpi.h>
typedef struct container {
int labels;
struct container *linked_to_container;
int sense;
} container;
container *allcontainers; /* an array for all containers */
int size_containers_array;
int get_next_container_id(int current_container_index, int max_index)
{
if (max_index - current_container_index >= 1)
{
return current_container_index + 1;
}
else
return 0; /* elements at two ends are linked */
}
container *get_container(int index)
{
return &allcontainers[index];
}
void container_init(int num_containers)
{
allcontainers = (container *) malloc(num_containers * sizeof(container)); /* is this right to malloc memory on the array of container when the struct size is still unknown?*/
size_containers_array = num_containers;
int i;
for (i = 0; i < num_containers; i++)
{
container *current_container = get_container(i);
current_container->labels = 0;
int next_container_id = get_next_container_id(i, num_containers - 1); /* max index in all_containers[] is num_containers-1 */
current_container->linked_to_container = get_container(next_container_id);
current_container->sense = 0;
}
}
void container_barrier()
{
int current_container_id, my_sense = 1;
int tag = current_container_id;
MPI_Request request[size_containers_array];
MPI_Status status[size_containers_array];
MPI_Comm_rank(MPI_COMM_WORLD, &current_container_id);
container *current_container = get_container(current_container_id);
int next_container_id = get_next_container_id(current_container_id, size_containers_array - 1);
/* send asynchronous message to the next container, wait, then do blocking receive */
MPI_Isend(&my_sense, 1, MPI_INT, next_container_id, tag, MPI_COMM_WORLD, &request[current_container_id]);
MPI_Wait(&request[current_container_id], &status[current_container_id]);
MPI_Recv(&my_sense, 1, MPI_INT, next_container_id, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
void free_containers()
{
free(allcontainers);
}
int main(int argc, char **argv)
{
int my_id, num_processes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_processes);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
container_init(num_processes);
printf("Hello world from thread %d of %d \n", my_id, num_processes);
container_barrier();
printf("passed barrier \n");
MPI_Finalize();
free_containers();
return 0;
}
The problem is the series of calls:
MPI_Isend()
MPI_Wait()
MPI_Recv()
This is a common source of confusion. When you use a "nonblocking" call in MPI, you are essentially telling the MPI library that you want to do some operation (send) with some data (my_sense). MPI gives you back an MPI_Request object with the guarantee that the call will be finished by the time a completion function finishes that MPI_Request.
The problem you have here is that you're calling MPI_Isend and immediately calling MPI_Wait before ever calling MPI_Recv on any rank. This means that all of those send calls get queued up but never actually have anywhere to go because you've never told MPI where to put the data by calling MPI_Recv (which tells MPI that you want to put the data in my_sense).
The reason this works part of the time is that MPI expects that things might not always sync up perfectly. If you smaller messages (which you do), MPI reserves some buffer space and will let your MPI_Send operations complete and the data gets stashed in that temporary space for a while until you call MPI_Recv later to tell MPI where to move the data. Eventually though, this won't work anymore. The buffers will be full and you'll need to actually start receiving your messages. For you, this means that you need to switch the order of your operations. Instead of doing a non-blocking send, you should do a non-blocking receive first, then do your blocking send, then wait for your receive to finish:
MPI_Irecv()
MPI_Send()
MPI_Wait()
The other option is to turn both functions into nonblocking functions and use MPI_Waitall instead:
MPI_Isend()
MPI_Irecv()
MPI_Waitall()
This last option is usually the best. The only thing that you'll need to be careful about is that you don't overwrite your own data. Right now you're using the same buffer for both the send and receive operations. If both of these are happening at the same time, there's no guarantees about the ordering. Normally this doesn't make a difference. Whether you send the message first or receive it doesn't really matter. However, in this case it does. If you receive data first, you'll end up sending the same data back out again instead of sending the data you had before the receive operation. You can solve this by using a temporary buffer to stage your data and move it to the right place when it's safe.

Master-Slave non-blocking listener

Currently I'm trying to create a master-slave program with a "listener" loop where the master waits for a message from slaves to make a decision from there. But, despite using non-blocking MPI routines, I am experiencing an error. Do I need to use some blocking routine?
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char** argv)
{
// Variable Declarations
int rank, size;
MPI_Request *requestList,requestNull;
MPI_Status status;
// Start MPI
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if( rank == 0 )
{
// Process Zero
int dataOut=13, pr;
float dataIn = -1;
requestList =(MPI_Request*)malloc((size-1)*sizeof(MPI_Request));
while(1){
dataIn = -1;
// We do NOT need to wait for the MPI_ Isend(s), it is the job of the receiver processes.
for(pr=1;pr<size;pr++)
{
MPI_Irecv(&dataIn,1,MPI_FLOAT,pr,1,MPI_COMM_WORLD,&(requestList[pr-1]));
}
if((dataIn > 1.5)){
printf("From the process: %f\n", dataIn);
break;
}
}
}
else
{
// Receiver Process
float message;
int index;
//MPI_Request request;
MPI_Status status;
while(1){
message = random()/(double)1147483648;
// Send the message back to the process zero
MPI_Isend(&message,1,MPI_FLOAT,0,1,MPI_COMM_WORLD, &requestNull);
if(message > 1.5)
break;
}
}
MPI_Finalize();
return 0;
}
The problem seems to be that you're never waiting for your MPI calls to complete.
The way non-blocking calls work in MPI is that when you issue a non-blocking call (like MPI_IRECV), the last input parameter is an MPI_REQUEST object. When the initialization call (MPI_IRECV) is done, that request object contains the information for the non-blocking call. However, that call isn't done yet and you have no guarantee that the call is done until you use a completion call (MPI_WAIT/MPI_TEST/and friends) on the request.
In your case, you probably are using the non-blocking calls unnecessarily since you rely on the information received during the MPI_IRECV call in the next line. You should probably just substitute your non-blocking call for a blocking MPI_RECV call and use MPI_ANY_SOURCE so you don't have to post a separate receive call for each rank in the communicator. Alternatively, you can use MPI_WAITANY to complete your non-blocking calls, but then you'll need to worry about cleaning up all of the extra operations when you finish.

MPI parallel Programming

I have three user-defined functions in a Monte Carlo simulator program. In main() they are being called using the appropriate parameters.
It is a serial program.
How do I convert it into the Parallel program?
The steps I have done so far for the serial program to make it as an MPI Parallel Program
are:
#include <conio.h>
#include <stdio.h>
#include "mpi.h"
//Global Varibles Declared
#define a=4;
#define b=2;
#define c=4;
#define d=6;
function1(Parameter4, Parameter))
{
// body of function
}
function2( parameter 1, parameter2)
{
//body of function
}
int main(int argc, char *argv[])
{
// Local Variables defined
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
function1(a, b);
function2(c, d);
MPI_Finalize ();
}
Now my questions are
Where do I specify
Number of processor(like running it with 2, 4, 6 , 8 processors)
Send and Recv Methods
How do I see the graphs of output using different number of processor.
Could any try to help me please as I am new to this language and don't know lot about it.
MPI is a communications protocol. We can't help you without know what platform/library you are working with. If you know what library you are working with, odds are good that there is a sample on the web somewhere showing how to implement Monte Carlo simulations with it.
First of all, your sample code is not valid C code. The #define lines should look like:
#define a 4
The number of processors is typically specified when running the program, which is usually done via
mpiexec -np PROCS MPIPROG
or the like, where PROCS is the number of MPI tasks to start and MPIPROG is the name of the compiled MPI executable. There is also a possibility to spawn MPI tasks from within MPI, but this doesn't work everywhere, so I would not recommend it. The advantage of specifying the number of tasks at runtime is that you can choose how many tasks to use depending on the platform you are working on.
Send and Recv can be used anywhere in the code, after MPI_Init has been called, and before MPI_Finalize has been called. As an example, to send an integer from task 0 to task 1, you would use something like
int number;
if (rank == 0) {
/* compute the number on proc 0 */
number = some_complex_function();
/* send the number to proc 1 */
MPI_Send(&number, 1, MPI_INT, 1, 42, MPI_COMM_WORLD);
} else if (rank == 1) {
/* receive the number from proc 0 */
MPI_Recv(&number, 1, MPI_INT, 0, 42, MPI_COMM_WORLD, 0);
}
// now both procs can do something with the number
Note that in this case, Task 1 will have to wait until the number is received from task 0, so in a real application you might want to give task 1 some work to do while task 0 computes "some_complex_function".

Resources