How to call same rank in MPI using C language?

How to call same rank in MPI using C language? - c

I am trying to learn MPI programing and have written the following program. It adds an entire row of array and outputs the sum. At rank 0 (or process 0), it will call all its slave ranks to do the calculation. I want to do this using only two other slave ranks/process. Whenever I try to invoke same rank twice as shown in the code bellow, my code would just hang in the middle and wouldn't execute. If I don't call the same rank twice, code would work correctly
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int tag2 = 1;
int arr[30] = {0};
MPI_Request request;
MPI_Status status;
printf ("\n--Current Rank: %d\n", world_rank);
int index;
int source = 0;
int dest;
if (world_rank == 0)
{
int i;
printf("* Rank 0 excecuting\n");
index = 0;
dest = 1;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = i + 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 0;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2; //Problem happens here when I try to call the same destination(or rank 2) twice
//If I change this dest value to 3 and run using: mpirun -np 4 test, this code will work correctly
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
}
else
{
int sum = 0;
int i;
MPI_Irecv(&arr[0], 30, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<30; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
}
MPI_Finalize();
}
Result when using: mpirun -np 3 test
--Current Rank: 2
Sum is: 0 at rank: 2
--Current Rank: 0
* Rank 0 excecuting
--Current Rank: 1
Sum is: 524800 at rank: 1
//Program hangs here and wouldn't show sum of 30
Please let me know how can I call the same rank twice. For example, if I only had two other slave process that I can call.
Please show with an example if possible

In MPI, each process executes the same code and, as you're doing, you differentiate the different processes primarily through checking rank in if/else statements. The master process with rank 0 is doing 3 sends: a send to process 1, then two sends to process 2. The slave processes each do only one receive, which means that rank 1 receives its first message and rank 2 receives its first message. When you call the third MPI_Send on process 0, there is not and will not be any slave waiting to receive the message after that point, as the slaves have finished executing their else block. The program gets blocked as the master waits to send the final message.
In order to fix that, you have to make sure the slave of rank 2 performs two receives, either by adding a loop for that process only or by repeating for that process only (so, using an if(world_rank == 2) check) the block of code
sum = 0; //resetting sum
MPI_Irecv(&arr[0], 1024, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<1024; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);

TL;DR: Just a remark that master/slave approach isn't to be favoured and might format programmer's mind durably, leading to poor code when put in production
Although Clarissa is perfectly correct and her answer is very clear, I'd like to add a few general remarks, not about the code itself, but about parallel computing philosophy and good habits.
First a quick preamble: when one wants to parallelise one's code, it can be for two main reasons: making it faster and/or permitting it to handle larger problems by overcoming the limitations (like memory limitation) found on a single machine. But in all cases, performance matters and I will always assume that MPI (or generally speaking parallel) programmers are interested and concerned by performance. So the rest of my post will suppose that you are.
Now the main reason of this post: over the past few days, I've seen a few questions here on SO about MPI and parallelisation, obviously coming from people eager to learn MPI (or OpenMP for that matter). This is great! Parallel programming is great and there'll never be enough parallel programmers. So I'm happy (and I'm sure many SO members are too) to answer questions helping people to learn how to program in parallel. And in the context of learning how to program in parallel, you have to write some simple codes, doing simple things, in order to understand what the API does and how it works. These programs might look stupid from the distance and very ineffective, but that's fine, that's how leaning works. Everybody learned this way.
However, you have to keep in mind that these programs you write are only that: API learning exercises. They're not the real thing and they do not reflect the philosophy of what an actual parallel program is or should be. And what drives my answer here is that I've seen here and in other questions and answers put forward recurrently the status of "master" process and "slaves" ones. And that is wrong, fundamentally wrong! Let me explain why:
As Clarissa perfectly pinpointed, "in MPI, each process executes the same code". The idea is to find a way of making several processes to interact for working together in solving a (possibly larger) problem (hopefully faster). But amongst these processes, none gets any special status, they are all equal. They are given an id to be able to address them, but rank 0 is no better than rank 1 or rank 1025... By artificially deciding that process #0 is the "master" and the others are its "slaves", you break this symmetry and it has consequences:
Now that rank #0 is the master, it commands, right? That's what a master does. So it will be the one getting the information necessary to run the code, will distribute it's share to the workers, will instruct them to do the processing. Then it will wait for the processing to be concluded (possibly getting itself busy in-between but more likely just waiting or poking the workers, since that's what a master does), collect the results, do a bit of reassembling and output it. Job done! What's wrong with that?
Well, the following are wrong:
During the time the master gets the data, slaves are sitting idle. This is sequential and ineffective processing...
Then the distribution of the data and the work to do implies a lot of transfers. This takes time and since it is solely between process #0 and all the others, this might create a lot of congestions on the network in one single link.
While workers do their work, what should the master do? Working as well? If yes, then it might not be readily available to handle requests from the slaves when they come, delaying the whole parallel processing. Waiting for these requests? Then it wastes a lot of computing power by sitting idle... Ultimately, there is no good answer.
Then points 1 and 2 are repeated in reverse order, with the gathering of the results and outputting or results. That's a lot of data transfers and sequential processing, which will badly damage the global scalability, effectiveness and performance.
So I hope you now see why master/slaves approach is (usually, not always but very often) wrong. And the danger I see from the questions and answers I've read of the past days is that you might get you mind formatted in this approach as if it was the "normal" way of thinking in parallel. Well, it is not! Parallel programming is symmetry. It is handling the problem globally, in all places at the same time. You have to think parallel from the start and see your code as a global parallel entity, not just a brunch of processes that need to be instructed on what to do. Each process is it's own master, dealing with it's peers on an equal ground. Each process should (as much as possible) acquire its data by itself (making it a parallel processing); decide what to do based on the number of peers involved in the processing and its id; exchange information with its peers when necessary, should it be locally (point to point communications) or globally (collective communications); and issue its own share of the result (again leading to parallel processing)...
OK, that's a bit extreme a requirement for people just starting to learn parallel programming and I want by no mean to tell you that your learning exercises should be like this. But keep the goal in mind and don't forget that API learning examples are only API learning examples, not reduced models of actual codes. So keep on experimenting with MPI calls to understand what they do and how they work, but try to slowly tend towards symmetrical approach on your examples. That can only be beneficial for you in the long term.
Sorry for this lengthy and somewhat off topic answer, and good luck with your parallel programming.

Related

mpi send and recv all processes to one another

I am trying to send data between all processes where I have an array on each process such as
int local_data[] = {0*rank,1*rank,2*rank,3*rank};
I have a corresponding flag array where each value in that array points to which process I should be sending this value, for example:
int part[] = {0,1,3,2};
so this means local_data[0] should go to process with rank 0
local_data[2] should go to process with rank 3 and so on.
The values in the flag arr changes from one process to the other ( all within range 0-P-1 where P is the total number of processes available) .
Using this, What I am currently doing is :
for(int i=0; i<local_len; i++){
if(part[i] != rank){
MPI_Send(&local_data[i], 1,MPI_INT, part[i], 0, MPI_COMM_WORLD);
MPI_Recv(&temp,1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status );
recvbuf[j] = temp;
j++;
}
else{
recvbuf[j] = local_data[i];
j++;
}
}
where I am only sending and receiving data if the part[i] != rank, to avoid sending and receiving from the same process
recvbuf is the array I receive the values in for each process. It can be longer than the initial local_data length.
I also tried
MPI_Sendrecv(&local_data[i], 1,MPI_INT, part[i], 0, &temp, 1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status);
the program gets stuck for both ways
How do I go about solving this?
Is the All-to-All collective the way to go here?

Your basic problem is that your send call goes to a dynamically determined target, but there is no corresponding logic to determine which processes need to do a receive at all, and if so, from where.
If the logic of your application implies that everyone will send to everyone, then you can use MPI_Alltoall.
If everyone sends to some, but you know that you will receive exactly four messages, then you can combine MPI_Isend for the sends and MPI_Recv from ANY_SOURCE. Note that you need Isend because your code will deadlock, strictly speaking. It may work if your MPI has a "eager mode" for small messages.
If the number of sends and the targets are entirely random, then you need something like MPI_Ibarrier to detect that all is over and done.
But I suspect you're leaving out major information here. Why is the length of local_data 4? Is the part array a permutation? Et cetera.

Following #GillesGouaillardet advice, I used MPI_AlltoAllv
to solve this problem.

Is there any way in MPI_Programs to order the execution of processes?

Say I have 2 processes, P1 and P2 and both P1 and P2 are printing an array of 1000 data points. As we know, we can't guarantee anything about the order of output, it may be P1 prints the data first followed by P2 or vice versa, or it can be that both outputs are getting mixed. Now say I want to output the values of P1 first followed by P2. Is there any way by which I can guarantee that?
I am attaching a Minimal Reproducible Example in which output gets mixed herewith
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main( int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int myrank, size; //size will take care of number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &myrank) ;
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(myrank==0)
{
int a[1000];
for(int i=0;i<1000;i++)
{
a[i]=i+1;
}
for(int i=0;i<1000;i++)
{
printf(" %d",a[i]);
}
}
if(myrank==1)
{
int a[1000];
for(int i=0;i<1000;i++)
{
a[i]=i+1;
}
for(int i=0;i<1000;i++)
{
printf(" %d",a[i]);
}
}
MPI_Finalize();
return 0;
}
The only way I can think of to output the data sequentially is that sending the data from say P1 to P0 and then printing it all from P0. But then we will incur the extra computational cost of sending data from one process to another.

You have some additional options:
pass a token. Processes can block waiting for a message,print whatever, then send to the next rank
let something else deal with ordering. Each process prefixes it's rank to the output, then you can sort the output by rank.
let's say this was a file. Each rank could compute where it should write and then everybody can carry out a right to the file in the correct location in parallel (which is what mpi-io apps will do)

Now say I want to output the values of P1 first followed by P2. Is
there any way by which I can guarantee that?
This is not how MPI is meant to be used, actually parallelism in general IMO. The coordination of printing the output to the console among processes will greatly degrade the performance of the parallel version, which defeats one of the purposes of parallelism i.e., reducing the overall execution time.
Most of the times one is better off just making one process responsible for printing the output to the console (typically the master process i.e., process with rank = 0).
Citing #Gilles Gouaillardet:
The only safe option is to send all the data to a given rank, and then print the data from that rank.
You could try using MPI_Barrier to coordinate the processes in a way that would print the output has you want, however (citing #Hristo Iliev):
Using barriers like that only works for local launches when (and if)
the processes share the same controlling terminal. Otherwise, it is
entirely to the discretion of the I/O redirection mechanism of the MPI
implementation.
If it is for debugging purposes you can either use a good MPI-aware debugger that allows to look into the content of the data of each process. Alternatively, you can limiting the output to be printed at one process at the time per run so that you can check if all the processes have the data that they should have.

MPI - write to same binary file. Each process takes huge difference in time intervals to complete

I am trying to write in the same binary file using MPI. I set the offset for each process in the beginning as per the rank. Then the following code snippet in C runs. All MPI process executes and computes the value and it writes to the exact offset as set.
The problem I am facing is, say, out of 32 Process, one process is executed in 2 hours. Rest of the process keeps running for more than 24 hours, The thing is, it computes the values as expected but it takes so much time. It seems like a deadlock situation, each process waits for some resource. But, I am not sharing/communicating between the processes. I am just using MPI_File_write_at to write at a specific location in the binary file. I need to mention that each process computes huge amount of data so storing it temporarily seemed inappropriate.
for(i=1;i<=limit;i++)
{
for(j=i+1;j<=limit;j++)
{
if(my_rank == step%num_cpus)
{
Calc = Calculation();
buf[0] = (double)Calc;
MPI_File_write_at(outFile, OUT_ofst, buf, 1, MPI_DOUBLE, &status);
Calc = 0.0;
OUT_ofst += num_cpus*MPI_File_write_at(sizeof(double));
count++;
}
step++;
}
}
I am new to MPI and I guess people must have had similar issues while executing in MPI. Can anyone help me out please! I can provide more details if needed.

MPI with C: Passive RMA synchronization

as I found no answer for my question so far and am on the edge of going crazy about the problem, I just ask the question tormenting my mind ;-)
I'm working on a parallelization of a node-elimination algorithm I already programmed. Target environment is a cluster.
In my parallel program I distinguish on master process (in my case rank 0) and the working slaves (every rank except 0).
My idea is it, that the master is keeping track which slaves are available and send them then work. Therefore and for some other reasons I try to establish a workflow basing on passive RMA with lock-put-unlock sequences. I use an integer array named schedule in which for every position in the array representing a rank is either 0 for a working process or 1 for an available process (so if schedule[1]=1 one is available for work).
If a process is done with its work, it puts in the array on the master the 1 signalising its availability. The code I tried for that is as follows:
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0
printf("Process %d:\t exclusive lock on process 0 started\n",myrank);
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0
printf("Process %d:\t put operation called\n",myrank);
MPI_Win_unlock(0,win); // the window is unlocked
It worked perfectly, especially when the master process was synchronized with a barrier to the end of the lock because then the output of master was made after the put operation.
As a next step I tried to let master check on regular basis whether there are available slaves or not. Therefore I created a while-loop to repeat until every process signalized its availability (I repeat that it is program teaching me the principles, I know that the implementation still doesn't do what I want).
The loop is in a base variant just printing my array schedule and then checking in a function fnz whether there are other working processes than master:
while(j!=1){
printf("Process %d:\t following schedule evaluated:\n",myrank);
for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule
printf("\n");
j=fnz(schedule);
}
And then the concept blew up. After inverting the process and getting the required information with get from the slaves by the master instead of putting it with put from the slaves to the master I found out my main problem is the acquiring of the lock: the unlock command doesn't succeed because in the case of the put the lock isn't granted at all and in the case of the get the lock is only granted when the slave process is done with its work and waiting in a barrier. In my opinion there has to be a serious error in my thinking. It can't be the idea of passive RMA that the lock can only be achieved when the target process is in a barrier synchronizing the whole communicator. Then I could just go along with standard Send/Recv operations. What I want to achieve is, that process 0 is working all the time in delegating work and being able by RMA of the slaves to identify to whom it can delegate.
Can please someone help me and explain how I can get a break on process 0 to allow the other processes getting locks?
Thank you in advance!
UPDATE:
I'm not sure if you ever worked with a lock and just want to stress out that I'm perfectly able to get an updated copy of a remote memory window. If I get the availability from the slaves the lock is only granted when the slaves are waiting in a barrier. So what I got to work is, that process 0 performs lock-get-unlock while process 1 and 2 are simulating work such that process 2 is remarkably longer occupied than one. what I expect as a result is that process 0 prints a schedule (0,1,0) because process 0 isn't asked at all wether it's working, process 1 is done with working and process 2 is still working. In the next step, when process 2 is ready, I expect the output (0,1,1), since the slaves are both ready for new work. What I get is that the slaves only grant the lock for process 0 when they are waiting in a barrier, so that the first and only output I get at all is the last one I expect, showing me that the lock was granted for each individual process first, when it was done with its work. So if please someone could tell me when a lock can be granted by the target process instead of trying to confuse my knowledge about passive RMA, I would be very grateful

First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.
Achieving working and portable passive RMA with MPI-2 involves several steps.
Step 1: Window allocation in the target process
For portability and performance reasons the memory for the window should be allocated using MPI_ALLOC_MEM:
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
}
MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
...
MPI_Win_free(win);
MPI_Free_mem(schedule);
Step 2: Memory synchronisation at the target
The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):
It is erroneous to have concurrent conflicting accesses to the same memory location in a
window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.
Therefore each access to schedule[] in the target has to be protected by a lock (shared since it only reads the memory location):
while (!ready)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with --enable-opal-multi-threads (disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):
6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.
If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related
synchronization call.
This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of schedule[] if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):
Advice to users. A user can write correct programs by following the following rules:
...
lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are
protected by shared locks, both for local accesses and for RMA accesses.
Step 3: Providing the correct parameters to MPI_PUT at the origin
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created with disp_unit == sizeof(int) is:
int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
The local value of one is thus transferred into rank * sizeof(int) bytes following the beginning of the window at the target. If disp_unit was set to 1, the correct put would be:
MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);
Step 4: Dealing with implementation specifics
The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The osc (one-sided communication) framework comes in two implementations - rdma and pt2pt. The default (in Open MPI 1.6.x and probably earlier) is rdma and for some reason it does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER in your case). On the other hand the pt2pt module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select the pt2pt component:
$ mpiexec --mca osc pt2pt ...
A fully working C99 sample code follows:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
static double starttime = -1.0;
int diff = 0;
for (int i = 0; i < size; i++)
diff |= (schedule[i] != oldschedule[i]);
if (diff)
{
int res = 0;
if (starttime < 0.0) starttime = MPI_Wtime();
printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
for (int i = 0; i < size; i++)
{
printf("\t%d", schedule[i]);
res += schedule[i];
oldschedule[i] = schedule[i];
}
printf("\n");
return(res == size-1);
}
return 0;
}
int main (int argc, char **argv)
{
MPI_Win win;
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
int *oldschedule = malloc(size * sizeof(int));
// Use MPI to allocate memory for the target window
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
oldschedule[i] = -1;
}
// Create a window. Set the displacement unit to sizeof(int) to simplify
// the addressing at the originator processes
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
int ready = 0;
while (!ready)
{
// Without the lock/unlock schedule stays forever filled with 0s
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
printf("All workers checked in using RMA\n");
// Release the window
MPI_Win_free(&win);
// Free the allocated memory
MPI_Free_mem(schedule);
free(oldschedule);
printf("Master done\n");
}
else
{
int one = 1;
// Worker processes do not expose memory in the window
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// Simulate some work based on the rank
sleep(2*rank);
// Register with the master
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
MPI_Win_unlock(0, win);
printf("Worker %d finished RMA\n", rank);
// Release the window
MPI_Win_free(&win);
printf("Worker %d done\n", rank);
}
MPI_Finalize();
return 0;
}
Sample output with 6 processes:
$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule: 0 0 0 0 0 0
[ 1.995] Schedule: 0 1 0 0 0 0
Worker 1 finished RMA
[ 3.989] Schedule: 0 1 1 0 0 0
Worker 2 finished RMA
[ 5.988] Schedule: 0 1 1 1 0 0
Worker 3 finished RMA
[ 7.995] Schedule: 0 1 1 1 1 0
Worker 4 finished RMA
[ 9.988] Schedule: 0 1 1 1 1 1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done

The answer by Hristo Lliev works perfectly if I use newer versions of the Open-MPI library.
However, on the cluster we are currently using, this is not possible and for the older versions there was deadlock behavior for the final unlock calls, as described by Hhristo. Adding the options --mca osc pt2pt did solve the deadlock in a sense but the MPI_Win_unlock calls still didn't seem to complete until the process owning the accessed variable did its own lock/unlock of the window. This is not very useful when you have jobs with very different completion times.
Therefore from a pragmatical point of view, though strictly speaking leaving the topic of passive RMA synchronization (for which I do apologize), I would like to point out a workaround which makes use of external files for those who are stuck with using old versions of the Open-MPI library so they don't have to loose so much time as I did:
You basically create an external file containing the information about which (slave) process does which job, instead of an internal array. This way, you don't even have to have a master process only dedicated to the bookkeeping of the slaves: It can also perform a job. Anyway, every process can go look in this file which job is to be done next and possibly determine that everything is done.
The important point is now that this information file is not accessed at the same time by multiple processes, as this might cause work to be duplicated or worse. The equivalent of the locking and unlocking of the window in MPI is here imitated easiest by using a locking file: This file is created by the process currently accessing the information file. The other processes have to wait for the current process to finish by checking with a slight time delay whether the lock file still exists.
The full information can be found here.

Segmentation fault for pthreads in a recursive call

Given the code below, I get a segmentation fault if I run it with n>16.
I think it has something to do with the stack, but I can't figure it out. Could anyone give me a hand? The code is not mine, and really not important. I would just like someone to give me a hand with what is happening. This SO question is very similar, but there's not enough information (the person who posts the answer briefly talks about the problem, but then goes on to talk about a different language). Besides, notice that with two gigs and no recursion, I can (if I'm doing it right) successfully create more than 16000 threads (though the OS only creates about 500 and runs about 300). Anyway, where am I getting the seg fault here and why? Thanks.
#include <pthread.h>
#include <stdio.h>
static void* fibonacci_thread( void* arg ) {
int n = (int)arg, fib;
pthread_t th1, th2;
void* pvalue; /*Holds the value*/
switch (n) {
case 0: return (void*)0;
case 1: /* Fallthru, Fib(1)=Fib(2)=1 */
case 2: return (void*)1;
default: break;
}
pthread_create(&th1, NULL, fibonacci_thread, (void*)(n-1));
pthread_create( &th2, NULL, fibonacci_thread, (void*)(n-2));
pthread_join(th1, &pvalue);
fib = (int)pvalue;
pthread_join(th2, &pvalue);
fib += (int)pvalue;
return (void*)fib;
}
int main(int argc, char *argv[])
{
int n=15;
printf ("%d\n",(int)fibonacci_thread((void*)n));
return 0;
}

This is not a good way to do a Fibonacci sequence :-)
Your first thread starts two others, each of those starts two others and so forth. So when n > 16, you may end up with a very large number of threads (in the thousands) (a).
Unless your CPU has way more cores than mine, you'll be wasting your time running thousands of threads for a CPU-bound task like this. For purely CPU bound tasks, you're better off having as many threads as there are physical execution engines (cores or CPUs) available to you. Obviously that changes where you're not purely CPU bound.
If you want an efficient recursive (non-threaded) Fibonacci calculator, you should use something like (pseudo-code):
def fib(n, n ranges from 1 to infinity):
if n is 1 or 2:
return 1
return fib(n-1) + fib(n-2)
Fibonacci isn't even really that good for non-threaded recursion since the problem doesn't reduce very fast. By that, I mean calculating fib(1000) will use 1000 stack frames. Compare this with a recursive binary tree search where only ten stack frames are needed. That's because the former only removes 1/1000 of the search space for each stack frame while the latter removes one half of the remaining search space.
The best way to do Fibonacci is with iteration:
def fib(n, n ranges from 1 to infinity):
if n is 1 or 2:
return 1
last2 = 1, last1 = 1
for i ranges from 3 to n:
last0 = last2 + last1
last2 = last1
last1 = last0
return last0
Of course, if you want a blindingly fast Fibonacci generator, you write a program to generate all the ones you can store (in, for example, a long value) and write out a C structure to contain them. Then incorporate that output into your C program and your runtime "calculations" will blow any other method out of the water. This is your standard "trade off space for time" optimisation method:
long fib (size_t n) {
static long fibs[] = {0, 1, 1, 2, 3, 5, 8, 13, ...};
if (n > sizeof(fibs) / sizeof(*fibs))
return -1;
return fibs[n];
}
These guidelines apply to most situations where the search space doesn't reduce that fast (not just Fibonacci).
(a) Originally, I thought this would be 216 but, as the following program shows (and thanks to Nemo for setting me straight), it's not quite that bad - I didn't take into account the reducing nature of the spawned threads as you approached fib(0):
#include <stdio.h>
static count = 0;
static void fib(int n) {
if (n <= 2) return;
count++; fib(n-1);
count++; fib(n-2);
}
int main (int argc, char *argv[]) {
fib (atoi (argv[1]));
printf ("%d\n", count);
return 0;
}
This is equivalent to the code you have, but it simply increments a counter for each spawned thread rather than actually spawning them. The number of threads for various input values are:
N Threads
--- ---------
1 0
2 0
3 2
4 4
5 8
6 14
:
14
15 1,218
16 1,972
:
20 13,528
:
26 242,784
:
32 4,356,616
Now note that, while I said it wasn't as bad, I didn't say it was good :-) Even two thousand threads is a fair load on a system with each having their own kernel structures and stacks. And you can see that, while the increases start small, they quickly accelerate to the point where they're unmanageable. And it's not like the 32nd number is large - it's only a smidgeon over two million.
So the bottom line still stands: use recursion where it makes sense (where you can reduce the search space relatively quickly so as to not run out of stack space), and use threds where it makes sense (where you don't end up running so many that you overload the resources of the operating system).

Heck with it, might as well make this an answer.
First, check the return values of pthread_create and pthread_join. (Always, always, always check for errors. Just assert they are returning zero if you are feeling lazy, but never ignore them.)
Second, I could have sworn Linux glibc allocates something like 2 megabytes of stack per thread by default (configurable via pthread_attr_setstacksize). Sure, that is only virtual memory, but on a 32-bit system that still limits you to ~2000 threads total.
Finally, I believe the correct estimate for the number of threads this will spawn is basically fib(n) itself (how nicely recursive). Or roughly phi^n, where phi is (1+sqrt(5))/2. So the number of threads here is closer to 2000 than to 65000, which is consistent with my estimate for where a 32-bit system will run out of VM.
[edit]
To determine the default stack size for new threads on your system, run this program:
int main(int argc, char *argv[])
{
size_t stacksize;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_getstacksize(&attr, &stacksize);
phthread_attr_destroy(&attr);
printf("Default stack size = %zd\n", stacksize);
return 0;
}
[edit 2]
To repeat: This is nowhere near 2^16 threads.
Let f(n) be the number of threads spawned when computing fib(n).
When n=16, one thread spawns two new threads: One to compute fib(15) and another to compute fib(14). So f(16) = f(15) + f(14) + 1.
And in general f(n) = f(n-1) + f(n-2) + 1.
As it turns out, the solution to this recurrence is that f(n) is just the sum of the first n Fibonacci numbers:
1 + 1 + 2 + 3 + 5 + 8 // f(6)
+ 1 + 1 + 2 + 3 + 5 // + f(5)
+ 1 // + 1
= 1 + 1 + 2 + 3 + 5 + 8 + 13 // = f(7)
This is (very) roughly phi^(n+1), not 2^n. Total for f(16) is still measured in the low thousands, not tens of thousands.
[edit 3]
Ah, I see, the crux of your question is this (hoisted from the comments):
Thanks Nemo for a detailed answer. I did a little test and
pthread_created ~10,000 threads with just a while(1) loop inside so
they don't terminate... and it did! True that the OS was smart enouth
to only create about 1000 and run an even smaller number, but it
didn't run out of stack. Why do I not get a segfault when I generate
lots more than THREAD_MAX, but I do when I do it recursively?
Here is my guess.
You only have a few cores. At any time, the kernel has to decide which threads are going to run. If you have (say) 2 cores and 500 threads, then any particular thread is only going to run 1/250 of the time. So your main loop spawning new threads is not going to run very often. I am not even sure whether the kernel's scheduler is "fair" with respect to threads within a single process, so it is at least conceivable that with 1000 threads the main thread never gets to run at all.
At the very least, each thread doing while (1); is going to run for 1/HZ on its core before giving up its time slice. This is probably 1ms, but it could be as high as 10ms depending on how your kernel was configured. So even if the scheduler is fair, your main thread will only get to run around once a second when you have thousands of threads.
Since only the main thread is creating new threads, the rate of thread creation slows to a crawl and possibly even stops.
Try this. Instead of while (1); for the child threads in your experiment, try while (1) pause();. (pause is from unistd.h.) This will keep the child threads blocked and should allow the main thread to keep grinding away creating new threads, leading to your crash.
And again, please check what pthread_create returns.

first thing i would do is run a statement like printf("%i", PTHREAD_THREADS_MAX); and see what the value is; i don't think that the max threads of the OS is necessarily the same as the max number of pthreads, although i do see that you say you can successfully achieve 16000 threads with no recursion so i'm just mentioning it as something i would check in general.
should PTHREAD_THREADS_MAX be significantly > the number threads you are achieving, i would start checking the return values of the pthread_create() calls to see if you are getting EAGAIN. my suspicion is that the answer to your question is that you're getting a segfault from attempting to use an uninitialized thread in a join...
also, as paxdiablo mentioned, you're talking on the order of 2^16 threads at n=16 here (a little less assuming some of them finishing before the last are created); i would probably try to keep a log to see in what order each was created. probably the easiest thing would be just to use the (n-1) (n-2) values as your log items otherwise you would have to use a semaphore or similar to protect a counter...
printf might bog down, in fact, i wouldn't be surprised if that actually affected things by allowing more threads to finish before new ones were started, but so i'd probably just log using file write(); can just be a simple file you should be able to get a feel for what's going on by looking at the patterns of numbers there. (wait, that assumes file ops are thread safe; i think they are; it's been a while.)
also, once checking for EAGAIN you could try sleeping an bit and retrying; perhaps it will ramp up over time and the system is just being overwhelmed by the sheer number of thread requests and failing for some other reason than being out of resources; this would verify whether just waiting and restarting could get you where you want to be.
finally; i might try to re-write the function as a fork() (i know fork() is evil or whatever ;)) and see whether you have better luck there.
:)