C MPI - spawning multiple threads in batches

C MPI - spawning multiple threads in batches - c

I have a basic question about MPI programming in C. Essentially what I want is that there is a master process that spawns a specific number of child processes, collects some information from all of them (waits until all of the children finish), calculates some metric, based on this metric it decides if it has to spawn more threads... it keeps doing this until the metric meets some specific condition. I have searched through the literature, to no avail. How can this be done? any pointers?.
Thanks for the help.
Courtesy : An introduction to the Message Passing Interface (MPI) using C. In the "complete parallel program to sum an array", lets say, "for some lame reason", I want the master process to sum the contents of the array twice. I.e in the first iteration, the master process starts the slave processes which compute the sum of the arrays, once they are done and the master process returns the value, I would like to invoke the master process to reinvoke another set of threads to do the computation again. Why would the code below not work? I added a while loop around the master process process which spawns the slave processes.
#include <stdio.h>
#include <mpi.h>
#define max_rows 100000
#define send_data_tag 2001
#define return_data_tag 2002
int array[max_rows];
int array2[max_rows];
main(int argc, char **argv)
{
long int sum, partial_sum,number_of_times;
number_of_times=0;
MPI_Status status;
int my_id, root_process, ierr, i, num_rows, num_procs,
an_id, num_rows_to_receive, avg_rows_per_process,
sender, num_rows_received, start_row, end_row, num_rows_to_send;
/* Now replicte this process to create parallel processes.
* From this point on, every process executes a seperate copy
* of this program */
ierr = MPI_Init(&argc, &argv);
root_process = 0;
/* find out MY process ID, and how many processes were started. */
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
if(my_id == root_process) {
/* I must be the root process, so I will query the user
* to determine how many numbers to sum. */
//printf("please enter the number of numbers to sum: ");
//scanf("%i", &num_rows);
num_rows=10;
while (number_of_times<2)
{
number_of_times++;
start_row=0;
end_row=0;
if(num_rows > max_rows) {
printf("Too many numbers.\n");
exit(1);
}
avg_rows_per_process = num_rows / num_procs;
/* initialize an array */
for(i = 0; i < num_rows; i++) {
array[i] = i + 1;
}
/* distribute a portion of the bector to each child process */
for(an_id = 1; an_id < num_procs; an_id++) {
start_row = an_id*avg_rows_per_process + 1;
end_row = (an_id + 1)*avg_rows_per_process;
if((num_rows - end_row) < avg_rows_per_process)
end_row = num_rows - 1;
num_rows_to_send = end_row - start_row + 1;
ierr = MPI_Send( &num_rows_to_send, 1 , MPI_INT,
an_id, send_data_tag, MPI_COMM_WORLD);
ierr = MPI_Send( &array[start_row], num_rows_to_send, MPI_INT,
an_id, send_data_tag, MPI_COMM_WORLD);
}
/* and calculate the sum of the values in the segment assigned
* to the root process */
sum = 0;
for(i = 0; i < avg_rows_per_process + 1; i++) {
sum += array[i];
}
printf("sum %i calculated by root process\n", sum);
/* and, finally, I collet the partial sums from the slave processes,
* print them, and add them to the grand sum, and print it */
for(an_id = 1; an_id < num_procs; an_id++) {
ierr = MPI_Recv( &partial_sum, 1, MPI_LONG, MPI_ANY_SOURCE,
return_data_tag, MPI_COMM_WORLD, &status);
sender = status.MPI_SOURCE;
printf("Partial sum %i returned from process %i\n", partial_sum, sender);
sum += partial_sum;
}
printf("The grand total is: %i\n", sum);
}
}
else {
/* I must be a slave process, so I must receive my array segment,
* storing it in a "local" array, array1. */
ierr = MPI_Recv( &num_rows_to_receive, 1, MPI_INT,
root_process, send_data_tag, MPI_COMM_WORLD, &status);
ierr = MPI_Recv( &array2, num_rows_to_receive, MPI_INT,
root_process, send_data_tag, MPI_COMM_WORLD, &status);
num_rows_received = num_rows_to_receive;
/* Calculate the sum of my portion of the array */
partial_sum = 0;
for(i = 0; i < num_rows_received; i++) {
partial_sum += array2[i];
}
/* and finally, send my partial sum to hte root process */
ierr = MPI_Send( &partial_sum, 1, MPI_LONG, root_process,
return_data_tag, MPI_COMM_WORLD);
}
ierr = MPI_Finalize();
}

You should start by looking at MPI_Comm_spawn and collective operations. To collect information from old child processes,one would typically use MPI_Reduce.
This stackoverflow question might also be helpful.
...to spawn more threads...
I guess you meant the right thing since you used "process" instead of "thread" mostly, but just to clarify: MPI only deals with processes and not with threads.
I'm not sure how well you know MPI already - let me know if my answer was any help or if you need more hints.

The MPI-2 standard includes process management functionality. It's described in detail in Chapter 5. I have not used it myself though, so perhaps someone else may weigh in with more practical hints.

Related

MPI with C slower if more processes are used

I am learning MPI with C and I wrote a code based on the one presented in this link: http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml.
In this code a vector containing 1e8 values are summed. However, I am observing that when using more processes the run time is getting bigger. The code is given bellow:
/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml
Code which splits a vector and send information to other processes.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.
Each process will calculate the partial sum of vector values and send it back to root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.
compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum
x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
*/
#include<stdio.h>
#include<mpi.h>
#include<math.h>
#define vec_len 100000000
double vec1[vec_len];
double vec2[vec_len];
int main(int argc, char* argv[]){
// defining program variables
int i;
double sum, partial_sum;
// defining parallel step variables
int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
int num_2_send, num_2_recv, start_point, vec_size, rows_per_proc, leftover;
ierr = MPI_Init(&argc, &argv);
root_process = 0;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
if(my_id == root_process){
// Root process: Define vector size, how to split vector and send information to workers
vec_size = 1e8; // size of main vector
for(i = 0; i < vec_size; i++){
//vec1[i] = pow(-1.0,i+2)/(2.0*(i+1)-1.0); // defining main vector... Correct answer for total sum = 0.78539816339
vec1[i] = pow(i,2)+1.0; // defining main vector...
//printf("Main vector position %d: %f\n", i, vec1[i]); // uncomment if youwhish to print the main vector
}
rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.
// spliting and sending the values
for(an_id = 1; an_id < num_proc; an_id++){
if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
num_2_send = rows_per_proc + leftover; // counting the amount of data to be sent.
start_point = (an_id - 1)*num_2_send; // defining initial position in the main vector (data will be sent from here)
}
else{
num_2_send = rows_per_proc;
start_point = (an_id - 1)*num_2_send + leftover; // starting point for other processes if there is leftover.
}
ierr = MPI_Send(&num_2_send, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data is going to workers.
ierr = MPI_Send(&vec1[start_point], num_2_send, MPI_DOUBLE, an_id, 1234, MPI_COMM_WORLD); // sending pieces of the main vector.
}
sum = 0;
for(an_id = 1; an_id < num_proc; an_id++){
ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
sum = sum + partial_sum;
}
printf("Total sum = %f.\n", sum);
}
else{
// Workers:define which operation will be carried out by each one
ierr = MPI_Recv(&num_2_recv, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must expect.
ierr = MPI_Recv(&vec2, num_2_recv, MPI_DOUBLE, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving main vector pieces.
partial_sum = 0;
for(i=0; i < num_2_recv; i++){
//printf("Position %d from worker id %d: %d\n", i, my_id, vec2[i]); // uncomment if youwhish to print position, id and value of splitted vector
partial_sum = partial_sum + vec2[i];
}
printf("Partial sum of %d: %f\n",my_id, partial_sum);
ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
}
ierr = MPI_Finalize();
}
Obs.: Compile as
mpicc -o vector_sum vector_send.c -lm
and run as:
time mpirun -n x vector_sum
with x = 2 and 5. You will see that with x=5 it takes more time to run.
Did I do something wrong? I did not expected it to be slower, since the summation of each chunk is independent. Or it is a matter of how the program is sending the information for each process? It seems to me that the loops for sending the information for each process is the responsible for this longer time.

As suggested by Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet): I modified the code to generate the vector pieces in each process instead of passing them from the root process. It worked! Now the elapsed time is smaller for more processes. I am posting the new code bellow:
/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml
Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.
Each process will generate and calculate the partial sum of the vector values and send it back to the root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.
compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum
x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/
#include<stdio.h>
#include<mpi.h>
#include<math.h>
#define vec_len 100000000
double vec2[vec_len];
int main(int argc, char* argv[]){
// defining program variables
int i;
double sum, partial_sum;
// defining parallel step variables
int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
int vec_size, rows_per_proc, leftover, num_2_gen, start_point;
ierr = MPI_Init(&argc, &argv);
root_process = 0;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
if(my_id == root_process){
vec_size = 1e8; // defining main vector size
rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.
// defining the number of data and position corresponding to main vector
for(an_id = 1; an_id < num_proc; an_id++){
if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
num_2_gen = rows_per_proc + leftover; // counting the amount of data to be generated.
start_point = (an_id - 1)*num_2_gen; // defining corresponding initial position in the main vector.
}
else{
num_2_gen = rows_per_proc;
start_point = (an_id - 1)*num_2_gen + leftover; // defining corresponding initial position in the main vector for other processes if there is leftover.
}
ierr = MPI_Send(&num_2_gen, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data must be generated.
ierr = MPI_Send(&start_point, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of initial positions on main vector.
}
sum = 0;
for(an_id = 1; an_id < num_proc; an_id++){
ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
sum = sum + partial_sum;
}
printf("Total sum = %f.\n", sum);
}
else{
ierr = MPI_Recv(&num_2_gen, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must generate.
ierr = MPI_Recv(&start_point, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of initial positions.
// generate and sum vector pieces
partial_sum = 0;
for(i = start_point; i < start_point + num_2_gen; i++){
vec2[i] = pow(i,2)+1.0;
partial_sum = partial_sum + vec2[i];
}
printf("Partial sum of %d: %f\n",my_id, partial_sum);
ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
}
ierr = MPI_Finalize();
return 0;
}
In this new version, instead of passing the main vector pieces, it is passed the just the information of how generate those pieces in each process.

The new code using MPI_Reduce() is faster and simpler than the previous one:
/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml
Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 0.
Process id 0 is the root process. However, it will also perform part of calculations.
Each process will generate and calculate the partial sum of the vector values. It will be used MPI_Reduce() to calculate the total sum.
Since the processes are independent, the printing order will be different at each run.
compile as: mpicc -o vector_sum vector_sum.c -lm
run as: time mpirun -n x vector_sum
x = number of splits desired + root process. For example: if x = 3, the vector will be splited in two.
Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/
#include<stdio.h>
#include<mpi.h>
#include<math.h>
#define vec_len 100000000
double vec2[vec_len];
int main(int argc, char* argv[]){
// defining program variables
int i;
double sum, partial_sum;
// defining parallel step variables
int my_id, num_proc, ierr, an_id, root_process;
int vec_size, rows_per_proc, leftover, num_2_gen, start_point;
vec_size = 1e8; // defining the main vector size
ierr = MPI_Init(&argc, &argv);
root_process = 0;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
rows_per_proc = vec_size/num_proc; // getting the number of elements for each process.
rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
leftover = vec_size - num_proc*rows_per_proc; // counting the leftover.
if(my_id == 0){
num_2_gen = rows_per_proc + leftover; // if there is leftover, it is calculate in process 0
start_point = my_id*num_2_gen; // the corresponding position on the main vector
}
else{
num_2_gen = rows_per_proc;
start_point = my_id*num_2_gen + leftover; // the corresponding position on the main vector
}
partial_sum = 0;
for(i = start_point; i < start_point + num_2_gen; i++){
vec2[i] = pow(i,2) + 1.0; // defining vector values
partial_sum += vec2[i]; // calculating partial sum
}
printf("Partial sum of process id %d: %f.\n", my_id, partial_sum);
MPI_Reduce(&partial_sum, &sum, 1, MPI_DOUBLE, MPI_SUM, root_process, MPI_COMM_WORLD); // calculating total sum
if(my_id == root_process){
printf("Total sum is %f.\n", sum);
}
ierr = MPI_Finalize();
return 0;
}

How does MPI_Reduce with MPI_MIN work?

if I have this code:
int main(void) {
int result=0;
int num[6] = {1, 2, 4, 3, 7, 1};
if (my_rank != 0) {
MPI_Reduce(num, &result, 6, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
} else {
MPI_Reduce(num, &result, 6, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD)
printf("result = %d\n", result);
}
}
the result print is 1 ;
But if the num[0]=9; then the result is 9
I read to solve this problem I must to define the variable num as array.
I can't understand how the function MPI_Reduce works with MPI_MIN. Why, if the num[0] is not equal to the smallest number, then I must to define the variable num as array?

MPI_Reduce performs a reduction over the members of the communicator - not the members of the local array. sendbuf and recvbuf must both be of the same size.
I think the standard says it best:
Thus, all processes provide input buffers and output buffers of the same length, with elements of the same type. Each process can provide one element, or a sequence of elements, in which case the combine operation is executed element-wise on each entry of the sequence.
MPI does not get the minimum of all elements in the array, you have to do that manually.

You can use MPI_MIN to obtain the min value among those passed via reduction.
Lets' examine the function declaration:
int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int root, MPI_Comm comm)
Each process send it's value (or array of values) using the buffer sendbuff.
The process identified by the root id receive the buffers and stores them in the buffer recvbuf. The number of elements to receive from each of the other processes is specified in count, so that recvbuff must be allocated with dimension sizeof(datatype)*count.
If each process has only one integer to send (count = 1) then recvbuff it's also an integer, If each process has two integers then recvbuff it's an array of integers of size 2. See this nice post for further explanations and nice pictures.
Now it should be clear that your code is wrong, sendbuff and recvbuff must be of the same size and there is no need of the condition: if(myrank==0). Simply, recvbuff has meaning only for the root process and sendbuff for the others.
In your example you can assign one or more element of the array to a different process and then compute the minvalue (if there are as many processes as values in the array) or the array of minvalues (if there are more values than processes).
Here is a working example that illustrates the usage of MPI_MIN, MPI_MAX and MPI_SUM (slightly modified from this), in the case of simple values (not array).
Each process do some work, depending on their rank and send to the root process the time spent doing the work. The root process collect the times and output the min, max and average values of the times.
#include <stdio.h>
#include <mpi.h>
int myrank, numprocs;
/* just a function to waste some time */
float work()
{
float x, y;
if (myrank%2) {
for (int i = 0; i < 100000000; ++i) {
x = i/0.001;
y += x;
}
} else {
for (int i = 0; i < 100000; ++i) {
x = i/0.001;
y += x;
}
}
return y;
}
int main(int argc, char **argv)
{
int node;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
/*variables used for gathering timing statistics*/
double mytime,
maxtime,
mintime,
avgtime;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/
mytime = MPI_Wtime(); /*get time just before work section */
work();
mytime = MPI_Wtime() - mytime; /*get time just after work section*/
/*compute max, min, and average timing statistics*/
MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD);
MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);
/* plot the output */
if (myrank == 0) {
avgtime /= numprocs;
printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime);
}
MPI_Finalize();
return 0;
}
If I run this on my OSX laptop, this is what I get:
urcaurca$ mpirun -n 4 ./a.out
Hello World from Node 3
Hello World from Node 0
Hello World from Node 2
Hello World from Node 1
Min: 0.000974 Max: 0.985291 Avg: 0.493081

MPI_scatter of 1D array

I new to MPI and I am trying to write program that uses MPI_scatter. I have 4 nodes(0, 1, 2, 3). Node0 is master, others are slaves. Master asks user for number of elements of array to send to slaves. Then it creates array of size number of elements * 4. Then every node prints it`s results.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER 0
int main(int argc, char **argv) {
int id, nproc, len, numberE, i, sizeArray;
int *arrayN=NULL;
int arrayNlocal[sizeArray];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (id == MASTER){
printf("Enter number of elements: ");
scanf("%d", &numberE);
sizeArray = numberE * 4;
arrayN = malloc(numberE * sizeof(int));
for (i = 0; i < sizeArray; i++){
arrayN[i] = i + 1;
}
}
MPI_Scatter(arrayN, numberE, MPI_INT, &arrayNlocal, numberE,MPI_INT, MPI_COMM_WORLD);
printf("Node %d has: ", id);
for (i = 0; i < numberE; i++){
printf("%d ",arrayNlocal[i]);
}
MPI_Finalize();
return 0;
}
And as error i get:
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
PID 9278 RUNNING AT 192.168.100.100
EXIT CODE: 139
CLEANING UP REMAINING PROCESSES
YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

In arrayNlocal[sizeArray];, sizeArray is not initialized. The best way to go is to broadcast numberE to every processes and allocate memory for arrayNlocal. Something like:
MPI_Bcast( &numberE, 1, MPI_Int, 0, MPI_COMM_WORLD)
arrayN is an array of size sizeArray = numberE * 4, so:
arrayN = malloc(sizeArray * sizeof(int));
MPI_Scatter() needs pointers to the data to be sent on root node, and a pointer to receive buffer on each process of the communicator. Since arrayNlocal is an array:
MPI_Scatter(arrayN, numberE, MPI_INT, arrayNlocal, numberE,MPI_INT,MASTER, MPI_COMM_WORLD);
or alternatively:
MPI_Scatter(arrayN, numberE, MPI_INT, &arrayNlocal[0], numberE,MPI_INT,MASTER, MPI_COMM_WORLD);
id is not initialized in id == MASTER: it must be rank==MASTER.
As is, the prints at the end might occur in a mixed way between processes.
Try to compile your code using mpicc main.c -o main -Wall to enable all warnings: it can save you a few hours in the near future!

MPI Broadcast 2D array

I have a 2D double precision array that is being manipulated in parallel by several processes. Each process manipulates a part of the array, and at the end of every iteration, I need to ensure that all the processes have the SAME copy of the 2D array.
Assuming an array of size 10*10 and 2 processes (or processors). Process 1 (P1) manipulates the first 5 rows of the 2D row (5*10=50 elements in total) and P2 manipulates the last 5 rows (50 elements total). And at the end of each iteration, I need P1 to have (ITS OWN first 5 rows + P2's last 5 rows). P2 should have (P1's first 5 rows + it's OWN last 5 rows). I hope the scenario is clear.
I am trying to broadcast using the code given below. But my program keeps exiting with this error: "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)".
I am already using a contiguous 2D memory allocator as pointed out here: MPI_Bcast a dynamic 2d array by Jonathan. But I am still getting the same error.
Can someone help me out?
My code:
double **grid, **oldgrid;
int gridsize; // size of grid
int rank, size; // rank of current process and no. of processes
int rowsforeachprocess, offset; // to keep track of rows that need to be handled by each process
/* allocation, MPI_Init, and lots of other stuff */
rowsforeachprocess = ceil((float)gridsize/size);
offset = rank*rowsforeachprocess;
/* Each process is handling "rowsforeachprocess" #rows.
* Lots of work done here
* Now I need to broadcast these rows to all other processes.
*/
for(i=0; i<gridsize; i++){
MPI_Bcast(&(oldgrid[i]), gridsize-2, MPI_DOUBLE, (i/rowsforeachprocess), MPI_COMM_WORLD);
}
Part 2: The code above is part of a parallel solver for the laplace equation using 1D decomposition and I did not want to use a Master-worker model. Will my code be easier if I use a Master-worker model?

The crash-causing problem here is a 2d-array pointer issue -- &(oldgrid[i]) is a pointer-to-a-pointer to doubles, not a pointer to doubles, and it points to the pointer to row i of your array, not to row i of your array. You want MPI_Bcast(&(oldgrid[i][0]),.. or MPI_Bcast(oldgrid[i],....
There's another way to do this, too, which only uses one expensive collective communicator instead of one per row; if you need everyone to have a copy of the whole array, you can use MPI_Allgather to gather the data together and distribute it to everyone; or, in the general case where the processes don't have the same number of rows, MPI_Allgatherv. Instead of the loop over broadcasts, this would look a little like:
{
int *counts = malloc(size*sizeof(int));
int *displs = malloc(size*sizeof(int));
for (int i=0; i<size; i++) {
counts[i] = rowsforeachprocess*gridsize;
displs[i] = i*rowsforeachprocess*gridsize;
}
counts[size-1] = (gridsize-(size-1)*rowsforeachprocess)*gridsize;
MPI_Allgatherv(oldgrid[offset], mynumrows*gridsize, MPI_DOUBLE,
oldgrid[0], counts, displs, MPI_DOUBLE, MPI_COMM_WORLD);
free(counts);
free(displs);
}
where counts are the number of items sent by each task, and displs are the displacements.
But finally, are you sure that every process has to have a copy of the entire array? If you're just computing a laplacian, you probably just need neighboring rows, not the whole array.
This would look like:
int main(int argc, char**argv) {
double **oldgrid;
const int gridsize=10; // size of grid
int rank, size; // rank of current process and no. of processes
int rowsforeachprocess; // to keep track of rows that need to be handled by each process
int offset, mynumrows;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
rowsforeachprocess = (int)ceil((float)gridsize/size);
offset = rank*rowsforeachprocess;
mynumrows = rowsforeachprocess;
if (rank == size-1)
mynumrows = gridsize-offset;
rowsforeachprocess = (int)ceil((float)gridsize/size);
offset = rank*rowsforeachprocess;
mynumrows = rowsforeachprocess;
if (rank == size-1)
mynumrows = gridsize-offset;
malloc2ddouble(&oldgrid, mynumrows+2, gridsize);
for (int i=0; i<mynumrows+2; i++)
for (int j=0; j<gridsize; j++)
oldgrid[i][j] = rank;
/* exchange row data with neighbours */
int highneigh = rank+1;
if (rank == size-1) highneigh = 0;
int lowneigh = rank-1;
if (rank == 0) lowneigh = size-1;
/* send data to high neibhour and receive from low */
MPI_Sendrecv(oldgrid[mynumrows], gridsize, MPI_DOUBLE, highneigh, 1,
oldgrid[0], gridsize, MPI_DOUBLE, lowneigh, 1,
MPI_COMM_WORLD, &status);
/* send data to low neibhour and receive from high */
MPI_Sendrecv(oldgrid[1], gridsize, MPI_DOUBLE, lowneigh, 1,
oldgrid[mynumrows+1], gridsize, MPI_DOUBLE, highneigh, 1,
MPI_COMM_WORLD, &status);
for (int proc=0; proc<size; proc++) {
if (rank == proc) {
printf("Rank %d:\n", proc);
for (int i=0; i<mynumrows+2; i++) {
for (int j=0; j<gridsize; j++) {
printf("%f ", oldgrid[i][j]);
}
printf("\n");
}
printf("\n");
}
MPI_Barrier(MPI_COMM_WORLD);
}

From OpenMP to MPI

I just wonder how to convert the following openMP program to a MPI program
#include <omp.h>
#define CHUNKSIZE 100
#define N 1000
int main (int argc, char *argv[])
{
int i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
} /* end of parallel section */
return 0;
}
I have a similar program that I would like to run on a cluster and the program is using OpenMP.
Thanks!
UPDATE:
In the following toy code, I want to limit the parallel part within function f():
#include "mpi.h"
#include <stdio.h>
#include <string.h>
void f();
int main(int argc, char **argv)
{
printf("%s\n", "Start running!");
f();
printf("%s\n", "End running!");
return 0;
}
void f()
{
char idstr[32]; char buff[128];
int numprocs; int myid; int i;
MPI_Status stat;
printf("Entering function f().\n");
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if(myid == 0)
{
printf("WE have %d processors\n", numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d", i);
MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
printf("%s\n", buff);
}
}
else
{
MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
sprintf(idstr, " Processor %d ", myid);
strcat(buff, idstr);
strcat(buff, "reporting for duty\n");
MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
printf("Leaving function f().\n");
}
However, the running output is not expected. The printf parts before and after the parallel part have been executed by every process, not just the main process:
$ mpirun -np 3 ex2
Start running!
Entering function f().
Start running!
Entering function f().
Start running!
Entering function f().
WE have 3 processors
Hello 1 Processor 1 reporting for duty
Hello 2 Processor 2 reporting for duty
Leaving function f().
End running!
Leaving function f().
End running!
Leaving function f().
End running!
So it seems to me the parallel part is not limited between MPI_Init() and MPI_Finalize().

To answer your update:
When using MPI, the same program is run by each processor. In order to restrict parallel parts you will need to use a statement like:
if (rank == 0) { ...serial work... }
This will ensure that only one processor does the work inside this block.
You can see how this works in the example program you posted, inside f(), there is the if(myid == 0) statement. This block of statements will then only be executed by process 0, all other processes go straight to the else and receive their messages, before sending them back.
With regard to MPI_Init and MPI_Finalize -- MPI_Init initialises the MPI environment. Once you have called this method you can use the other MPI methods like Send and Recv. Once you have finished using MPI methods, MPI_Finalize will free up the resources etc, but the program will keep running. For example, you could call MPI_Finalize before performing some I/O that was going to take a long time. These methods do not delimit the parallel portion of the code, merely where you can use other MPI calls.
Hope this helps.

You just need to assign a portion of the arrays (a, b, c) to each process. Something like this:
#include <mpi.h>
#define N 1000
int main(int argc, char *argv[])
{
int i, myrank, myfirstindex, mylastindex, procnum;
float a[N], b[N], c[N];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &procnum);
MPI_Comm_rank(comm, &myrank);
/* Dynamic assignment of chunks,
* depending on number of processes
*/
if (myrank == 0)
myfirstindex = 0;
else if (myrank < N % procnum)
myfirstindex = myrank * (N / procnum + 1);
else
myfirstindex = N % procnum + myrank * (N / procnum);
if (myrank == procnum - 1)
mylastindex = N - 1;
else if (myrank < N % procnum)
mylastindex = myfirstindex + N / procnum + 1;
else
mylastindex = myfirstindex + N / procnum;
// Initializations
for(i = myfirstindex; i < mylastindex; i++)
a[i] = b[i] = i * 1.0;
// Computations
for(i = myfirstindex; i < mylastindex; i++)
c[i] = a[i] + b[i];
MPI_Finalize();
}

You can try to use proprietary Intel Cluster OpenMP. It will run OpenMP programs on cluster.
Yes, it simulates shared memory computer on distributed memory clusters using the "Software Distributed shared memory" http://en.wikipedia.org/wiki/Distributed_shared_memory
It is easy to use and included in Intel C++ Compiler (9.1+). But it works only on 64-bit processors.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C MPI - spawning multiple threads in batches - c

The MPI-2 standard includes process management functionality. It's described in detail in Chapter 5. I have not used it myself though, so perhaps someone else may weigh in with more practical hints.

Related

MPI with C slower if more processes are used

How does MPI_Reduce with MPI_MIN work?

MPI_scatter of 1D array

MPI Broadcast 2D array

From OpenMP to MPI

Categories

Resources