How to handle MPI sendcount of zero

How to handle MPI sendcount of zero - c

What is the correct way to handle a sendcount = 0 when using MPI_Gatherv (or any other function that requires a sendcount) when setting up the displs argument?
I have data that needs to be received by all processors, but all processors might not have any data to send themselves. As an MWE, I tried (on just two processors):
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
int main(void)
{
int ntasks;
int thistask;
int n = 0;
int i;
int totcounts = 0;
int *data;
int *rbuf;
int *rcnts;
int *displs;
int *master_data;
int *master_displs;
// Set up mpi
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
MPI_Comm_rank(MPI_COMM_WORLD, &thistask);
// Allocate memory for arrays needed by allgatherv
rbuf = calloc(ntasks, sizeof(int));
rcnts = calloc(ntasks, sizeof(int));
displs = calloc(ntasks, sizeof(int));
master_displs = calloc(ntasks, sizeof(int));
// Initialize the counts and displacement arrays
for(i = 0; i < ntasks; i++)
{
rcnts[i] = 1;
displs[i] = i;
}
// Allocate data on just one task, but not others
if(thistask == 1)
{
n = 3;
data = calloc(n, sizeof(int));
for(i = 0; i < n; i++)
{
data[i] = i;
}
}
// Get n so each other processor knows about what others are sending
MPI_Allgatherv(&n, 1, MPI_INT, rbuf, rcnts, displs, MPI_INT, MPI_COMM_WORLD);
// Now that we know how much data each processor is sending, we allocate the array
// to hold it all
for(i = 0; i < ntasks; i++)
{
totcounts += rbuf[i];
}
master_data = calloc(totcounts, sizeof(int));
// Get displs for master data
master_displs[0] = 0;
for(i = 1; i < ntasks; i++)
{
master_displs[i] = master_displs[i - 1] + rbuf[i - 1];
}
// Send each processor's data to all others
MPI_Allgatherv(&data, n, MPI_INT, master_data, rbuf, master_displs, MPI_INT, MPI_COMM_WORLD);
// Print it out to see if it worked
if(thistask == 0)
{
for(i = 0; i < totcounts; i++)
{
printf("master_data[%d] = %d\n", i, master_data[i]);
}
}
// Free
if(thistask == 1)
{
free(data);
}
free(rbuf);
free(rcnts);
free(displs);
free(master_displs);
free(master_data);
MPI_Finalize();
return 0;
}
The way that I've set up master_displs works when every processor has a non-zero n (that is, they have data to send). In this case, both entries will be zero. However, the results of this program are garbage. How would I set up the master_displs array to ensure that master_data holds the correct information (in this case, just master_data[i] = i, as received from task 1)?

Related

MPI Subarray Sending Error

I firstly initialize a 4x4 matrix and then try to send the first 2x2 block to the slave process by using MPI in C. However the slave process only receives the first row of the block, the second row is filled with random numbers from computer ram. I couldn't find what is missing. The code of the program is below :
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define SIZE 4
int main(int argc, char** argv)
{
int rank, nproc;
const int root = 0;
const int tag = 3;
int** table;
int* datas;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
datas = malloc(SIZE * SIZE * sizeof(int));
table = malloc(SIZE * sizeof(int*));
for (int i = 0; i < SIZE; i++)
table[i] = &(datas[i * SIZE]);
for (int i = 0; i < SIZE; i++)
for (int k = 0; k < SIZE; k++)
table[i][k] = 0;
table[0][1] = 1;
table[0][2] = 2;
table[1][0] = 3;
table[2][3] = 2;
table[3][1] = 3;
table[3][2] = 4;
if (rank == root){
MPI_Datatype newtype;
int sizes[2] = { 4, 4 }; // size of table
int subsizes[2] = { 2, 2 }; // size of sub-region
int starts[2] = { 0, 0 };
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &newtype);
MPI_Type_commit(&newtype);
MPI_Send(&(table[0][0]), 1, newtype, 1, tag, MPI_COMM_WORLD);
}
else{
int* local_datas = malloc(SIZE * SIZE * sizeof(int));
int** local = malloc(SIZE * sizeof(int*));
for (int i = 0; i < SIZE; i++)
local[i] = &(local_datas[i * SIZE]);
MPI_Recv(&(local[0][0]), 4, MPI_INT, root, tag, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
for (int i = 0; i < 2; i++){
for (int k = 0; k < 2; k++)
printf("%3d ", local[i][k]);
printf("\n");
}
}
MPI_Finalize();
return 0;
}

You have instructed the receive operation to put four integer values consecutively in memory and therefore the 2x2 block is converted to a 1x4 row upon receive (since local is 4x4). The second row of local contains random values since the memory is never initialised.
You should either make use of MPI_Type_create_subarray in both the sender and the receiver in order to place the received data in a 2x2 block or redefine local to be a 2x2 matrix instead of 4x4.

MPI dynamically allocation arrays

I have a problem with a dynamically allocation of arrays.
This code, if I use a static allocation, runs without problem...
int main (int argc, char *argv[]){
int size, rank;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int lowerBound = 0, upperBound = 0, dimArrayTemp, x, z;
int dimBulk = size - 1, nPart, cnt;
FILE *pf;
pf = fopen("in.txt","r");
int array_value = fscanf(pf,"%d",&array_value);
float ins_array_value;
float *arrayTemp, *bulkSum,*s,*a;
arrayTemp =(float*)malloc(array_value*sizeof(float));
bulkSum = (float*)malloc(array_value*sizeof(float));
s =(float*) malloc(array_value*sizeof(float));
a =(float*) malloc(array_value*sizeof(float));
int j=0;
while(!feof(pf)){
fscanf(pf,"%f",&ins_array_value);
a[j] = ins_array_value;
j++;
}
fclose(pf);
float presum, valFinal;
if(size <= array_value){
if (rank == MASTER){
nPart = array_value/size;
int countPair;
if((array_value % size) != 0){
countPair = 0;
}
for (int i = 0; i < size; i++){
if(i == 0){
lowerBound = upperBound;
upperBound += nPart - 1;
}
else{
lowerBound += nPart;
upperBound += nPart;
if(countPair == 0 && i == size - 1)
upperBound = array_value - 1;
}
dimArrayTemp = upperBound - lowerBound;
//float arrayTemp[dimArrayTemp];
for( x = lowerBound, z = 0; x <= upperBound; x++, z++){
arrayTemp[z] = a[x];
}
if (i > 0){
//send array size
MPI_Send(&z,1,MPI_INT,i,0,MPI_COMM_WORLD);
//send value array
MPI_Send(arrayTemp,z,MPI_INT,i,1,MPI_COMM_WORLD);
}
else{
for (int h = 1;h <= dimArrayTemp; h++)
arrayTemp[h] = arrayTemp[h-1] + arrayTemp[h];
bulkSum[0] = arrayTemp[dimArrayTemp];
for (int h = 0; h <= dimArrayTemp; h++)
s[h] = arrayTemp[h];
}
}
}
else{
//recieve array size
MPI_Recv(&z,1,MPI_INT,0,0,MPI_COMM_WORLD, &status);
MPI_Recv(arrayTemp,z,MPI_INT,0,1,MPI_COMM_WORLD,&status);
for(int h = 1; h < z; h++){
arrayTemp[h] = arrayTemp[h-1] + arrayTemp[h];
presum = arrayTemp[h];
}
MPI_Send(&presum,1,MPI_INT,0,1,MPI_COMM_WORLD);
}
//MPI_Barrier(MPI_COMM_WORLD);
if (rank == MASTER){
for (int i = 1; i<size;i++){
MPI_Recv(&presum,1,MPI_INT,i,1,MPI_COMM_WORLD,&status);
bulkSum[i] = presum;
}
for (int i = 0; i<=dimBulk; i++){
bulkSum[i] = bulkSum[i-1] +bulkSum[i];
}
for(int i = 0; i<dimBulk;i++){
valFinal = bulkSum[i];
cnt = i+1;
MPI_Send(&valFinal,1,MPI_INT,cnt,1,MPI_COMM_WORLD);
}
}
else{
MPI_Recv(&valFinal,1,MPI_INT,0,1,MPI_COMM_WORLD,&status);
for(int i = 0; i<z;i++){
arrayTemp[i] = arrayTemp[i] + valFinal;
}
MPI_Send(arrayTemp,z,MPI_INT,0,1,MPI_COMM_WORLD);
}
if(rank == MASTER){
for(int i =1;i<size;i++){
MPI_Recv(arrayTemp,z,MPI_INT,i,1,MPI_COMM_WORLD,&status);
for(int v=0, w =dimArrayTemp+1 ;v<z;v++, w++){
s[w] = arrayTemp[v];
}
dimArrayTemp += z;
}
int count = 0;
for(int c = 0;c<array_value;c++){
printf("s[%d] = %f \n",count++,s[c]);
}
}
}
else{
printf("ERROR!!!\t number of procs (%d) is higher than array size(%d)!\n", size, array_value);
//fflush(stdout);
MPI_Finalize();
}
free(arrayTemp);
free(s);
free(a);
free(bulkSum);
MPI_Finalize();
return 0;
}
This is a specific declaration of arrays:
float *arrayTemp, *bulkSum,*s,*a;
arrayTemp =(float*)malloc(array_value*sizeof(float));
bulkSum = (float*)malloc(array_value*sizeof(float));
s =(float*) malloc(array_value*sizeof(float));
a =(float*) malloc(array_value*sizeof(float));
Any ideas?
EDIT:
I'm deleted reference for arrays in MPI_Send(); and MPI_Recv(); and condition master, the same error is occurred: a process exited on signal 6 (Aborted).

This is a very common rookie mistake. One often sees MPI tutorials where variables are passed by address to MPI calls, e.g. MPI_Send(&a, ...);. The address-of operator (&) is used to get the address of the variable and that address is passed to MPI as a buffer area for the operation. While & returns the address of the actual data storage for scalar variables and arrays, when applied to pointers it returns the address where the address pointed to is stored.
The simplest solution is to stick to the following rule: never use & with arrays or dynamically allocated memory, e.g.:
int a;
MPI_Send(&a, ...);
but
int a[10];
MPI_Send(a, ...);
and
int *a = malloc(10 * sizeof(int));
MPI_Send(a, ...);
Also, as noted by #talonmies, you are only allocating the arrays in the master process. You should remove the conditional surrounding the allocation calls.

making arrays static prevents the arrays to be modified in other functions thus preventing errors. If making arrays static causes no errors then let it be static and use them by call by reference or it might be a good idea to make the arrays global as a try.

mpi c mpirun noticed that process rank 0 exited on signal 11

I have an MPI program that is solving the "metric traveling salesman problem".
When I run it on windows, it works as expected, and prints the shortest possible path.
when i run it on linux, i get a message saying that mpirun noticed that process rank 0 exited on signal 11.
When I searched this problem on StackOverflow, I saw that it often occurs when sending wrong arguments to MPI's send/receive functions, but I went over my code, and the arguments seems fine.
How else can I check my error?
If it helps, here's the two code files:
main.c :
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// forward declaration of tsp_main
int tsp_main(int citiesNum, int xCoord[], int yCoord[], int shortestPath[]);
int main(int argc, char** argv)
{
//int citiesNum = 18; //set a lower number for testing
int citiesNum = 10;
int xCoord[] = {1, 12, 13, 5, 5, 10, 5, 6, 7, 8, 9, 4, 11, 14, 4,8,4,6};
int yCoord[] = {7, 2, 3, 3, 5, 6, 7, 8, 9, 4, 11, 12, 13, 14, 5,1,7,33};
int* shortestPath = (int*)malloc(citiesNum * sizeof(int));
int i, myRank, minPathLen;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
clock_t begin = clock();
minPathLen = tsp_main(citiesNum, xCoord, yCoord, shortestPath);
clock_t end = clock();
if (myRank == 0)
{
printf("Execution time: %g seconds\n", (double)(end - begin) / CLOCKS_PER_SEC);
printf("The shortest path, %d long, is:\n", minPathLen);
for (i = 0; i < citiesNum; i++)
{
// print the city (and its distance from the next city in the path)
printf("%d (%d) ", shortestPath[i],
abs(xCoord[shortestPath[i]] - xCoord[shortestPath[(i + 1) % citiesNum]]) +
abs(yCoord[shortestPath[i]] - yCoord[shortestPath[(i + 1) % citiesNum]]) );
}
printf("%d\n", shortestPath[0]);
}
MPI_Finalize ();
return 0;
}
and tsp.c :
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <assert.h>
#include <string.h>
#include <time.h>
//play with this value to change the workload.
#define WORK_LOAD (10)
#define getDistance(a,b) (DistanceArray[a][b])
enum defines {
MASTER_ID = 0, // master ID. set to 0 (better safe than sorry)
DISTRIBUTE_NEW_TASK, // when the master send a new task to a worker
ASK_FOR_TASK, // when a worker asks from the master a new task
KILL, // when the master notifies the worker to die
SEND_MINIMUM, // when a process sends its current minimum
SEND_PATH, // when a worker updates the master of his best path
};
// initializes the factorials array. the i cell will contain i!. for example, factorials[4] will contain 24.
void initializeFactorials(long long int* factorials, int citiesNum) {
int i;
factorials[0] = 1;
for (i = 1; i < citiesNum; ++i) {
factorials[i] = i * factorials[i-1];
}
}
// initializes the two dimensional distance array. Element k,l will contain the distance between city k and city l
void initializeDistanceArray(int** DistanceArray, int citiesNum, int xCoord[], int yCoord[]) {
int k,l;
for (k=0; k < citiesNum; ++k) {
DistanceArray[k] = (int*)malloc(sizeof(int)*citiesNum);
}
for (k=0; k < citiesNum; ++k) {
for (l=0; l < citiesNum; ++l) {
DistanceArray[k][l] = abs(xCoord[k] - xCoord[l]) + abs(yCoord[k] - yCoord[l]);
}
}
}
/* initializes the edge minimum array. Element i contains the minimum weight of i+1 edges between different cities.
For example, element 0 contains the minimal edge. Element 5 contains the total weight of 6 edges going out from different cities*/
void initializeEdgeMinimum(int* edgeMinimum, int** DistanceArray, int citiesNum) {
int k, l, sum=0;
for (k=0; k < citiesNum; ++k) {
edgeMinimum[k] = INT_MAX;
for (l=0; l < citiesNum; ++l) {
if (l == k) continue;
if (getDistance(k,l) < edgeMinimum[k]) edgeMinimum[k] = getDistance(k,l);
}
}
for (k=0; k < citiesNum-1; ++k) {
for (l=k+1; l < citiesNum; ++l) {
if (edgeMinimum[l]>edgeMinimum[k]){
int temp = edgeMinimum[k];
edgeMinimum[k] = edgeMinimum[l];
edgeMinimum[l] = temp;
}
}
}
for (k=citiesNum-1; k >= 0; --k) {
sum += edgeMinimum[k];
edgeMinimum[k] = sum;
}
}
/* takes an index of a path as an argument, and converts it according to the decision tree (as explained in the external documentation)
to a path (circle) between cities*/
void convertIndexToPath(long long int index, int citiesNum, int* decisionTree, int* options, int* path, long long int* factorials) {
int i, j, decision;
long long int fact;
for(i = 0; i < citiesNum; ++i) {
fact = factorials[citiesNum-(i+1)];
decision = (int)(index/fact);
decisionTree[i] = decision;
index -= fact*((long long int)decision);
}
for(i = 0; i < citiesNum; ++i) {
options[i] = i+1;
}
path[0] = 0;
for(i = 1; i < citiesNum; ++i) {
path[i] = options[decisionTree[i]];
for(j = decisionTree[i]; j < citiesNum-i-1; ++j) {
options[j] = options[j+1];
}
}
}
/* takes the decision tree of the last path (as explained in the external documentation) and converts it to a path
ASSUMPTION: can be used ONLY if the last path was NOT pruned */
void convertdDecisionToPath(long long int index, int citiesNum, int* decisionTree, int* options, int* path, long long int* factorials) {
int i, j;
for(i = citiesNum-2; i > 0; --i) {
decisionTree[i] = (decisionTree[i] + 1) % (citiesNum - i);
if (decisionTree[i] != 0) break;
}
for(i = 0; i < citiesNum; ++i) {
options[i] = i+1;
}
path[0] = 0;
for(i = 1; i < citiesNum; ++i) {
path[i] = options[decisionTree[i]];
for(j = decisionTree[i]; j < citiesNum-i-1; ++j) {
options[j] = options[j+1];
}
}
}
// returns one index before the next iteration of the index of the path that we need to explore right after pruning.
long long int getIndexAfterPrune(long long int index, int citiesVisitedInPath, int citiesNum, int* decisionTree, long long int* factorials) {
int decision, i;
long long int fact, nextIndex = 0, indexBackup = index;
for(i = 0; i < citiesNum; ++i) {
fact = factorials[citiesNum-(i+1)];
decision = (int)(index/fact);
decisionTree[i] = decision;
index -= fact*((long long int)decision);
}
for(i = citiesVisitedInPath + 1; i < citiesNum; ++i) {
decisionTree[i] = 0;
}
for(i = 0; i < citiesNum; ++i) {
nextIndex += (decisionTree[i] * factorials[citiesNum-1-i]);
}
nextIndex += factorials[citiesNum - citiesVisitedInPath];
return nextIndex-1;
}
// returns how many possibilities (paths) there are in a single chunk
long long int getChunkSize(int citiesNum, int workersNum, long long int* factorials) {
--citiesNum;
long long int allPossibilities = factorials[citiesNum];
// empirically setting the chunk size
long long int chunkSize = WORK_LOAD*(allPossibilities/factorials[citiesNum/2]) / (workersNum);
if (citiesNum <= 3 || chunkSize == 0) { //the job is small, and one worker can handle it
return allPossibilities;
}
return chunkSize;
}
// returns the number of chunks that we need to handle
long long int getNumberOfChunks(int citiesNum, long long int chunkSize, long long int* factorials) {
--citiesNum;
long long int allPossibilities = factorials[citiesNum];
int lastChunk = 0;
if (allPossibilities % chunkSize != 0) {
lastChunk = 1;
}
return (allPossibilities/chunkSize) + lastChunk;
}
// returns how many workers should work on the task.
int getNeededWorkers(long long int numberOfChunks, int workersNum) {
if (workersNum >= numberOfChunks) {
return (int)numberOfChunks;
}
return workersNum;
}
/*
Splits the problem into many sub problems and sends tasks to the workers.
each task contains the start index (the stop index is simply calculated from the chunk size) and the optimal price known so far.
The master also listens for updates of the optimal price.
when all the workers finish, the masters send them a request to update him with their optimal solution, and then decides what's the global optimum.
conventions: variables are in camelCase, consts are in ALL_CAPS, and two dimensional arrays are in PascalCasing
*/
int runMaster(int citiesNum, int xCoord[], int yCoord[], int shortestPath[], int processesNum) {
// Variables
int doneWorkers = 0, neededWorkers, gotAnswer, junk, currentMinimum = INT_MAX;
long long int chunkSize, indexToCheck = 0, bestPathIndex, numberOfChunks;
MPI_Status status1, status2, status3;
MPI_Request request1 = MPI_REQUEST_NULL, junkRequest = MPI_REQUEST_NULL;
// Arrays
int *decisionTree, *options;
long long int *factorials, *recieveBuffer, *sendBuffer;
// Dynamic Allocations
decisionTree = (int*)malloc(citiesNum * sizeof(int));
options = (int*)malloc(citiesNum * sizeof(int));
factorials = (long long int*)malloc(citiesNum * sizeof(long long int));
recieveBuffer = (long long int*)malloc(2 * sizeof(long long int));
sendBuffer = (long long int*)malloc(2 * sizeof(long long int));
initializeFactorials(factorials, citiesNum);
long long int lastIndex = factorials[citiesNum-1]-1;
chunkSize = getChunkSize(citiesNum, processesNum-1, factorials);
numberOfChunks = getNumberOfChunks(citiesNum, chunkSize, factorials);
neededWorkers = getNeededWorkers(numberOfChunks, processesNum-1);
while (doneWorkers < neededWorkers) {
//check if a worker wants a new task
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status1);
gotAnswer = 1;
MPI_Iprobe(MPI_ANY_SOURCE, ASK_FOR_TASK, MPI_COMM_WORLD, &gotAnswer, &status1);
if (gotAnswer) {
MPI_Recv(&junk, 0, MPI_INT, MPI_ANY_SOURCE, ASK_FOR_TASK, MPI_COMM_WORLD, &status1); //blocking recieve since we need the request to complete so we'd know who should get the task
if (indexToCheck <= lastIndex) {
// the master sends the current minimum, and a new job to the worker
sendBuffer[0] = currentMinimum;
sendBuffer[1] = indexToCheck;
indexToCheck += chunkSize;
MPI_Irsend(sendBuffer, 2, MPI_LONG_LONG_INT, status1.MPI_SOURCE, DISTRIBUTE_NEW_TASK, MPI_COMM_WORLD, &request1); // we're guaranteed that the worker called IRecv and ready to get a task. no need to block since we can continue doing are own calculations.
MPI_Request_free(&request1);
} else { // the master kills the worker
MPI_Irsend(&junk, 0, MPI_INT, status1.MPI_SOURCE, KILL, MPI_COMM_WORLD, &junkRequest); // we're guaranteed that the worker called IRecv and ready to get a task. no need to block since we can continue doing are own calculations.
MPI_Request_free(&junkRequest);
}
continue;
}
gotAnswer = 1;
MPI_Iprobe(MPI_ANY_SOURCE, SEND_MINIMUM, MPI_COMM_WORLD, &gotAnswer, &status2);
if(gotAnswer) { // the master recieves a miminal price from one of the workers and decides if it's the global minimum
MPI_Recv(recieveBuffer, 1, MPI_LONG_LONG_INT, MPI_ANY_SOURCE, SEND_MINIMUM, MPI_COMM_WORLD, &status2); // blocking, since we're going to use the recieve buffer
currentMinimum = (currentMinimum <= recieveBuffer[0]) ? currentMinimum : (int)recieveBuffer[0];
continue;
}
gotAnswer = 1;
MPI_Iprobe(MPI_ANY_SOURCE, SEND_PATH, MPI_COMM_WORLD, &gotAnswer, &status3);
if(gotAnswer) {
// the master recieves a miminal path and price from one of the workers and decides if it's the global minimum
MPI_Recv(recieveBuffer, 2, MPI_LONG_LONG_INT, MPI_ANY_SOURCE, SEND_PATH, MPI_COMM_WORLD, &status3); // blocking, since we're going to use the recieve buffer
++doneWorkers;
if(recieveBuffer[0] <= currentMinimum){
currentMinimum = (int)recieveBuffer[0];
bestPathIndex = recieveBuffer[1];
}
}
} //while
free(factorials); free(decisionTree); free(options); free(recieveBuffer); free(sendBuffer);
convertIndexToPath(bestPathIndex, citiesNum, decisionTree, options, shortestPath, factorials);
return currentMinimum;
}
/*
gets tasks from the master and process them until there are no more tasks to handle.
in each task, we go through all the possibilities in the current chunk, but skip paths that are heavier from the current known optimal path.
when we discover a new optimal path, we update the rest of the threads if necesssary (first, we check if we got a new optimal weight from them).
conventions: variables are in camelCase, consts are in ALL_CAPS, and two dimensional arrays are in PascalCasing
*/
void runWorker(int citiesNum, int xCoord[], int yCoord[], int shortestPath[], int processesNum) {
//Variables
int sum = 0, gotAnswer = 1, PRUNE_FACTOR = citiesNum - 3, LAST_CITY = citiesNum-1, indexReachedInPath, pruned, sumUntilPruned = 0, IndexUntilPruned = 0, doneWorkers = 0, neededWorkers, junk, myCurrentMinimum = INT_MAX, othersCurrentMinimum = INT_MAX, pid, k;
long long int numberOfChunks, chunkSize, indexToCheck = 0, startIndex, stopIndex, i, bestPathIndex = -1, lastIndex;
MPI_Status status1, status2;
MPI_Request request1 = MPI_REQUEST_NULL, junkRequest = MPI_REQUEST_NULL;
//Arrays
int *decisionTree, *edgeMinimum, *myCurrentPath, *options, **DistanceArray;
long long int *factorials, *recieveBuffer, *sendBuffer;
// Dynamic Allocations
decisionTree = (int*)malloc(citiesNum * sizeof(int));
edgeMinimum = (int*)malloc(sizeof(int)*citiesNum);
myCurrentPath = (int*)malloc(citiesNum * sizeof(int));
options = (int*)malloc(citiesNum * sizeof(int));
factorials = (long long int*)malloc(citiesNum * sizeof(long long int));
recieveBuffer = (long long int*)malloc(2 * sizeof(long long int));
sendBuffer = (long long int*)malloc(2 * sizeof(long long int));
DistanceArray = (int**)malloc(sizeof(int*)*citiesNum);
initializeFactorials(factorials, citiesNum);
initializeDistanceArray(DistanceArray, citiesNum, xCoord, yCoord);
initializeEdgeMinimum(edgeMinimum, DistanceArray, citiesNum);
MPI_Comm_rank(MPI_COMM_WORLD, &pid);
lastIndex = factorials[citiesNum-1]-1;
chunkSize = getChunkSize(citiesNum, processesNum-1, factorials);
numberOfChunks = getNumberOfChunks(citiesNum, chunkSize, factorials);
neededWorkers = getNeededWorkers(numberOfChunks, processesNum-1);
if (pid > neededWorkers){ //free memory and exit
for (k=0; k < citiesNum; ++k) {
free(DistanceArray[k]);
}
free(factorials); free(decisionTree); free(options); free(recieveBuffer); free(sendBuffer); free(edgeMinimum); free(DistanceArray); free(myCurrentPath);
return;
}
MPI_Irecv(recieveBuffer, 2, MPI_LONG_LONG_INT, MASTER_ID, MPI_ANY_TAG, MPI_COMM_WORLD, &request1); // getting ready to recieve a new task from the master. no need to block
MPI_Ssend(&junk, 0, MPI_INT, MASTER_ID, ASK_FOR_TASK, MPI_COMM_WORLD); // asking the master for a new task. synced & blocking, since we don't have anything else to do until we get a new task
while(1) {
MPI_Wait(&request1, &status1); //avoid busy-wait
if (status1.MPI_TAG == DISTRIBUTE_NEW_TASK) {
// the worker got a new job from the master. recieveBuffer[0] contains the server's currentMinimum. recieveBuffer[1] contains indexToCheck
othersCurrentMinimum = (recieveBuffer[0] < othersCurrentMinimum) ? (int)recieveBuffer[0] : othersCurrentMinimum;
startIndex = recieveBuffer[1];
stopIndex = (startIndex + chunkSize >= lastIndex) ? lastIndex + 1 : startIndex + chunkSize;
pruned = 1;
indexReachedInPath = 0;
sum = 0;
for(i = startIndex; i < stopIndex; ++i) {
if (pruned) { // calculate the current path from the index
convertIndexToPath(i, citiesNum, decisionTree, options, myCurrentPath, factorials);
} else { // calculate the current path from the last path (decision tree)
convertdDecisionToPath(i, citiesNum, decisionTree, options, myCurrentPath, factorials);
}
sum = 0;
indexReachedInPath = 0;
pruned = 0;
for(; indexReachedInPath < LAST_CITY; ++indexReachedInPath) {
sum += getDistance(myCurrentPath[indexReachedInPath], myCurrentPath[indexReachedInPath+1]);
if (indexReachedInPath < PRUNE_FACTOR && sum + edgeMinimum[indexReachedInPath] >= othersCurrentMinimum) {
//prune
pruned = 1;
sum -= getDistance(myCurrentPath[indexReachedInPath], myCurrentPath[indexReachedInPath+1]);
if (indexReachedInPath == 0) {
i = getIndexAfterPrune(i,1,citiesNum, decisionTree, factorials);
} else {
i = getIndexAfterPrune(i,indexReachedInPath,citiesNum, decisionTree, factorials);
}
break;
}
}
if(pruned) continue;
sum += getDistance(myCurrentPath[LAST_CITY], myCurrentPath[0]); //return from the last city to the first
if(sum < othersCurrentMinimum) {
myCurrentMinimum = sum;
bestPathIndex = i;
//check for a new global minimum
gotAnswer = 1;
MPI_Iprobe(MPI_ANY_SOURCE, SEND_MINIMUM, MPI_COMM_WORLD, &gotAnswer, &status2);
if(gotAnswer) {
MPI_Recv(recieveBuffer, 1, MPI_INT, MPI_ANY_TAG, SEND_MINIMUM, MPI_COMM_WORLD, &status2); // blocking, since we're going to use the recieve buffer
othersCurrentMinimum = (recieveBuffer[0] < othersCurrentMinimum) ? (int)recieveBuffer[0] : othersCurrentMinimum;
}
if (myCurrentMinimum < othersCurrentMinimum) {
othersCurrentMinimum = sum;
for (k = 0; k < processesNum; ++k) {
if (junk == pid) continue;
MPI_Issend(&myCurrentMinimum, 1, MPI_INT, k, SEND_MINIMUM, MPI_COMM_WORLD, &junkRequest); // sending everyone our minimum, copying it to their memory. obviously, no need to block.
}
}
}
}
//send my minimum to the master if it's the global minimum
//if (myCurrentMinimum <= othersCurrentMinimum) {
// MPI_Issend(&myCurrentMinimum, 1, MPI_INT, MASTER_ID, SEND_MINIMUM, MPI_COMM_WORLD, &junkRequest); // sending the master our minimum.
//}
// get a new task from the master
MPI_Irecv(recieveBuffer, 2, MPI_LONG_LONG_INT, MASTER_ID, MPI_ANY_TAG, MPI_COMM_WORLD, &request1); // getting ready to recieve a new task from the master. no need to block
MPI_Ssend(&junk, 0, MPI_INT, MASTER_ID, ASK_FOR_TASK, MPI_COMM_WORLD); // asking the master for a new task. blocking, since we don't have anything else to do until we get a new task
continue;
}
if(status1.MPI_TAG == KILL) { // free resources, send the master the optimal path and price, and die.
for (k=0; k < citiesNum; ++k) {
free(DistanceArray[k]);
}
free(factorials); free(decisionTree); free(options); free(recieveBuffer); free(edgeMinimum); free(DistanceArray); free(myCurrentPath);
sendBuffer[0] = myCurrentMinimum;
sendBuffer[1] = bestPathIndex;
MPI_Ssend(sendBuffer, 2, MPI_LONG_LONG_INT, MASTER_ID, SEND_PATH, MPI_COMM_WORLD); // synced & blocking, since we don't have anything else to do until the master gets the information
free(sendBuffer);
return;
}
} // while
}
// The static parellel algorithm main function. runs the master and the workers.
int tsp_main(int citiesNum, int xCoord[], int yCoord[], int shortestPath[])
{
int rank, processesNum, result = 0;
MPI_Comm_size(MPI_COMM_WORLD, &processesNum);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
result = runMaster(citiesNum, xCoord, yCoord, shortestPath, processesNum);
} else {
runWorker(citiesNum, xCoord, yCoord, shortestPath, processesNum);
}
MPI_Barrier(MPI_COMM_WORLD);
return result;
}

I tried your code and i managed to get rid of the error.
I received a signal : "floating point exception : integer divide by zero". I searched where the exception occured and found that it came from the first /fact. It was thrown from proc 0, so i went to runMaster(). There is a line after the free(fact). I permuted these lines and the error disappeared.
This way may be the right one :
convertIndexToPath(bestPathIndex, citiesNum, decisionTree, options, shortestPath, factorials);
free(factorials); free(decisionTree); free(options); free(recieveBuffer); free(sendBuffer);
However, i tried the program using 2 or 3 processus and the outputs were different...I am surprised that it worked before !
Bye, Francis

MPI_Scatterv segfault

I'm just starting out with MPI programming and decided to make a simple distributed qsort using OpenMPI. To distribute parts of the array I want to sort I'm trying to use MPI_Scatterv, however the following code segfaults on me:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#define ARRAY_SIZE 26
#define BUFFER_SIZE 2048
int main(int argc, char** argv) {
int my_rank, nr_procs;
int* data_in, *data_out;
int* sizes;
int* offsets;
srand(time(0));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nr_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// everybody generates the control tables
int nr_workers = nr_procs-1;
sizes = malloc(sizeof(int)*nr_workers);
offsets = malloc(sizeof(int)*nr_workers);
int nr_elems = ARRAY_SIZE/nr_workers;
// basic distribution
for (int i = 0; i < nr_workers; ++i) {
sizes[i] = nr_elems;
}
// distribute the remainder
int left = ARRAY_SIZE%nr_workers;
int curr_worker = 0;
while (left) {
++sizes[curr_worker];
curr_worker = (++curr_worker)%nr_workers;
--left;
}
// offsets
int curr_offset = 0;
for (int i = 0; i < nr_workers; ++i) {
offsets[i] = curr_offset;
curr_offset += sizes[i];
}
if (my_rank == 0) {
// root
data_in = malloc(sizeof(int)*ARRAY_SIZE);
data_out = malloc(sizeof(int)*ARRAY_SIZE);
for (int i = 0; i < ARRAY_SIZE; ++i) {
data_in[i] = rand();
}
for (int i = 0; i < nr_workers; ++i) {
printf("%d at %d\n", sizes[i], offsets[i]);
}
MPI_Scatterv (data_in, sizes, offsets, MPI_INT, data_out, ARRAY_SIZE, MPI_INT, 0, MPI_COMM_WORLD);
} else {
// worker
printf("%d has %d elements!\n",my_rank, sizes[my_rank-1]);
// alloc the input buffer
data_in = malloc(sizeof(int)*sizes[my_rank-1]);
MPI_Scatterv(NULL, NULL, NULL, MPI_INT, data_in, sizes[my_rank-1], MPI_INT, 0, MPI_COMM_WORLD);
printf("%d got:\n", my_rank);
for (int i = 0; i < sizes[my_rank-1]; ++i) {
printf("%d ", data_in[i]);
}
printf("\n");
}
MPI_Finalize();
return 0;
}
How would I go about using Scatterv? Am I doing something wrong with allocating my input buffer from inside the worker code?

I changed some part in your code to get something working.
MPI_Scatter() will send data to every processors, including himself. According to your program, processor 0 expects ARRAY_SIZE integers, but sizes[0] is much smaller.
There are other problems on other processus : MPI_Scatter will send sizes[my_rank] integers, but sizes[my_rank-1] will be expected...
Here is a code that scatters data_in from 0 to all processors, including 0. Therefore i added 1 to nr_workers :
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#define ARRAY_SIZE 26
#define BUFFER_SIZE 2048
int main(int argc, char** argv) {
int my_rank, nr_procs;
int* data_in, *data_out;
int* sizes;
int* offsets;
srand(time(0));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nr_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// everybody generates the control tables
int nr_workers = nr_procs;
sizes = malloc(sizeof(int)*nr_workers);
offsets = malloc(sizeof(int)*nr_workers);
int nr_elems = ARRAY_SIZE/nr_workers;
// basic distribution
for (int i = 0; i < nr_workers; ++i) {
sizes[i] = nr_elems;
}
// distribute the remainder
int left = ARRAY_SIZE%nr_workers;
int curr_worker = 0;
while (left) {
++sizes[curr_worker];
curr_worker = (++curr_worker)%nr_workers;
--left;
}
// offsets
int curr_offset = 0;
for (int i = 0; i < nr_workers; ++i) {
offsets[i] = curr_offset;
curr_offset += sizes[i];
}
if (my_rank == 0) {
// root
data_in = malloc(sizeof(int)*ARRAY_SIZE);
for (int i = 0; i < ARRAY_SIZE; ++i) {
data_in[i] = rand();
printf("%d %d \n",i,data_in[i]);
}
for (int i = 0; i < nr_workers; ++i) {
printf("%d at %d\n", sizes[i], offsets[i]);
}
} else {
printf("%d has %d elements!\n",my_rank, sizes[my_rank]);
}
data_out = malloc(sizeof(int)*sizes[my_rank]);
MPI_Scatterv (data_in, sizes, offsets, MPI_INT, data_out, sizes[my_rank], MPI_INT, 0, MPI_COMM_WORLD);
printf("%d got:\n", my_rank);
for (int i = 0; i < sizes[my_rank]; ++i) {
printf("%d ", data_out[i]);
}
printf("\n");
free(data_out);
if(my_rank==0){
free(data_in);
}
MPI_Finalize();
return 0;
}
Regarding memory managment, data_in and data_out should be freed at the end of the code.
Is it what you wanted to do ? Good luck with qsort ! I think you are not the first one to sort integers using MPI. See parallel sort using mpi. Your way to generate random numbers on the 0 processus and then scatter them is the right way to go. I think you will be interrested by his TD_Trier() function for communication. Even if you change tri_fusion(T, 0, size - 1); for qsort(...)...
Bye,
Francis

Problem with MPI matrix-matrix multiply: Cluster slower than single computer

I code a small program using MPI to parallelize matrix-matrix multiplication. The problem is: When running the program on my computer, it takes about 10 seconds to complete, but about 75 seconds on a cluster. I think I have some synchronization problem, but I cannot figure it out (yet).
Here's my source code:
/*matrix.c
mpicc -o out matrix.c
mpirun -np 11 out
*/
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define N 1000
#define DATA_TAG 10
#define B_SENT_TAG 20
#define FINISH_TAG 30
int master(int);
int worker(int, int);
int main(int argc, char **argv) {
int myrank, p;
double s_time, f_time;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
if (myrank == 0) {
s_time = MPI_Wtime();
master(p);
f_time = MPI_Wtime();
printf("Complete in %1.2f seconds\n", f_time - s_time);
fflush(stdout);
}
else {
worker(myrank, p);
}
MPI_Finalize();
return 0;
}
int *read_matrix_row();
int *read_matrix_col();
int send_row(int *, int);
int recv_row(int *, int, MPI_Status *);
int send_tag(int, int);
int write_matrix(int *);
int master(int p) {
MPI_Status status;
int *a; int *b;
int *c = (int *)malloc(N * sizeof(int));
int i, j; int num_of_finish_row = 0;
while (1) {
for (i = 1; i < p; i++) {
a = read_matrix_row();
b = read_matrix_col();
send_row(a, i);
send_row(b, i);
//printf("Master - Send data to worker %d\n", i);fflush(stdout);
}
wait();
for (i = 1; i < N / (p - 1); i++) {
for (j = 1; j < p; j++) {
//printf("Master - Send next row to worker[%d]\n", j);fflush(stdout);
b = read_matrix_col();
send_row(b, j);
}
}
for (i = 1; i < p; i++) {
//printf("Master - Announce all row of B sent to worker[%d]\n", i);fflush(stdout);
send_tag(i, B_SENT_TAG);
}
//MPI_Barrier(MPI_COMM_WORLD);
for (i = 1; i < p; i++) {
recv_row(c, MPI_ANY_SOURCE, &status);
//printf("Master - Receive result\n");fflush(stdout);
num_of_finish_row++;
}
//printf("Master - Finish %d rows\n", num_of_finish_row);fflush(stdout);
if (num_of_finish_row >= N)
break;
}
//printf("Master - Finish multiply two matrix\n");fflush(stdout);
for (i = 1; i < p; i++) {
send_tag(i, FINISH_TAG);
}
//write_matrix(c);
return 0;
}
int worker(int myrank, int p) {
int *a = (int *)malloc(N * sizeof(int));
int *b = (int *)malloc(N * sizeof(int));
int *c = (int *)malloc(N * sizeof(int));
int i;
for (i = 0; i < N; i++) {
c[i] = 0;
}
MPI_Status status;
int next = (myrank == (p - 1)) ? 1 : myrank + 1;
int prev = (myrank == 1) ? p - 1 : myrank - 1;
while (1) {
recv_row(a, 0, &status);
if (status.MPI_TAG == FINISH_TAG)
break;
recv_row(b, 0, &status);
wait();
//printf("Worker[%d] - Receive data from master\n", myrank);fflush(stdout);
while (1) {
for (i = 1; i < p; i++) {
//printf("Worker[%d] - Start calculation\n", myrank);fflush(stdout);
calc(c, a, b);
//printf("Worker[%d] - Exchange data with %d, %d\n", myrank, next, prev);fflush(stdout);
exchange(b, next, prev);
}
//printf("Worker %d- Request for more B's row\n", myrank);fflush(stdout);
recv_row(b, 0, &status);
//printf("Worker %d - Receive tag %d\n", myrank, status.MPI_TAG);fflush(stdout);
if (status.MPI_TAG == B_SENT_TAG) {
break;
//printf("Worker[%d] - Finish calc one row\n", myrank);fflush(stdout);
}
}
//wait();
//printf("Worker %d - Send result\n", myrank);fflush(stdout);
send_row(c, 0);
for (i = 0; i < N; i++) {
c[i] = 0;
}
}
return 0;
}
int *read_matrix_row() {
int *row = (int *)malloc(N * sizeof(int));
int i;
for (i = 0; i < N; i++) {
row[i] = 1;
}
return row;
}
int *read_matrix_col() {
int *col = (int *)malloc(N * sizeof(int));
int i;
for (i = 0; i < N; i++) {
col[i] = 1;
}
return col;
}
int send_row(int *row, int dest) {
MPI_Send(row, N, MPI_INT, dest, DATA_TAG, MPI_COMM_WORLD);
return 0;
}
int recv_row(int *row, int src, MPI_Status *status) {
MPI_Recv(row, N, MPI_INT, src, MPI_ANY_TAG, MPI_COMM_WORLD, status);
return 0;
}
int wait() {
MPI_Barrier(MPI_COMM_WORLD);
return 0;
}
int calc(int *c_row, int *a_row, int *b_row) {
int i;
for (i = 0; i < N; i++) {
c_row[i] = c_row[i] + a_row[i] * b_row[i];
//printf("%d ", c_row[i]);
}
//printf("\n");fflush(stdout);
return 0;
}
int exchange(int *row, int next, int prev) {
MPI_Request request; MPI_Status status;
MPI_Isend(row, N, MPI_INT, next, DATA_TAG, MPI_COMM_WORLD, &request);
MPI_Irecv(row, N, MPI_INT, prev, MPI_ANY_TAG, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
return 0;
}
int send_tag(int worker, int tag) {
MPI_Send(0, 0, MPI_INT, worker, tag, MPI_COMM_WORLD);
return 0;
}
int write_matrix(int *matrix) {
int i;
for (i = 0; i < N; i++) {
printf("%d ", matrix[i]);
}
printf("\n");
fflush(stdout);
return 0;
}

Well, you have a fairly small matrix (N=1000), and secondly you distribute your algorithm on a row/column basis rather than blocked.
For a more realistic version using better algorithms, you might want to acquire an optimized BLAS library (e.g. GOTO is free), test single-thread performance with that one, then get PBLAS and link it against your optimized BLAS, and compare MPI parallel performance using the PBLAS version.

I see some errors in your program:
First, why are you calling the wait function since its implementation is simply calling MPI_Barrier. MPI_Barrier is a primitive synchronization that blocks all threads until they reach the "barrier" by calling MPI_Barrier. My question is: do you want the master to be synchronized with the workers? In this context, that would not be optimal because a worker doesn't need to wait for the master to begin its calculation.
Second, there are 2 unnecessary for loops.
for (i = 1; i < N / (p - 1); i++) {
for (j = 1; j < p; j++) {
b = read_matrix_col();
send_row(b, j);
}
}
for (i = 1; i < p; i++) {
send_tag(i, B_SENT_TAG);
}
In the first i-loop, you don't use the variable in your statement. Since the j-loop and the second i-loop are the same, you could do:
for (i = 0; i < p; i++) {
b = read_matrix_col();
send_row(b, j);
send_tag(i, B_SENT_TAG);
}
In terms of data transfer, your program is not optimized because you are sending an array of 1000 integers of data for each data transfer. There should be a better way to optimise the data transfer, but I will let you look at it. So make the corrections I told you and tell us what is your new performance.
And as #janneb said, you can use the BLAS library for better performance for matrix multiplication. Good luck!

I did not look over your code, but I can provide some hints about why your result may not unexpected:
As already mentioned, N=1000 may be too small. You should make more tests to see the scalability of your program (try setting N=100, 500, 1000, 5000, 10000, etc.) and compare results on both your system and the cluster.
Compare results between your system (one processor I presume) and a single processor on the cluster. Usually in production environments like servers or clusters a single processor is less powerful than the best processors designed for desktop use, but they provide stability, reliability and other features useful for environments which run 24h/day at full capacity.
If your processor has multiple cores, more than one MPI processes may run at the same time and synchronization between them is negligible compared to the synchronization between nodes in a cluster.
Are the nodes from the cluster statically assigned to you? Maybe other users' programs can be scheduled on the nodes you are running at the same time as you.
Read documentation about the cluster's architecture. Some architectures may be more suitable for particular classes of problems.
Assess latency of the network of the cluster. Ping-ing from each node to another many times and computing the mean value may give a rough estimate.
Last but perhaps the most important, your algorithm may not be optimal. Read a/some books on matrix multiplication (I can recommend "Matrix Computations", Golub and Van Loan).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to handle MPI sendcount of zero - c

Related

MPI Subarray Sending Error

MPI dynamically allocation arrays

mpi c mpirun noticed that process rank 0 exited on signal 11

MPI_Scatterv segfault

Problem with MPI matrix-matrix multiply: Cluster slower than single computer

Categories

Resources