MPI master process convergence loop

MPI master process convergence loop - c

I am trying to write a MPI program that simulates temperature flow throughout a grid to reach equilibrium. I have already written a serial version as well as parallel versions using openMP pthreads and cuda.
My goal is to parallelize a for loop that is calculating updated temperature values for a 1 dimensional array. The code I have to do the parallel part is here (all other variables are initialized above):
int nproc, rank,chunksize,leftover,offset,source, tag1=3,tag2=2,tag3=1;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
chunksize = (boxes / (nproc-1));
leftover = (boxes % (nproc-1));
if(rank == 0){
//init dsv
for(int idx = 0; idx < boxes; idx++){
temps[idx] = newtemps[idx];
}
int stop = 0;
int iter = 0;
float max_tmp;
float min_tmp;
while(stop != 1){
offset = 0;
for (int dest=1; dest<nproc; dest++) {
int chunk = (dest <= leftover ? chunksize + 1 : chunksize);
MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
MPI_Send(&temps[offset], chunk, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD);
MPI_Send(&newtemps[offset], chunk, MPI_FLOAT, dest, tag3, MPI_COMM_WORLD);
printf("sent %d temps to process: %d\n",chunk, dest);
offset = offset + chunk;
}
for (int dest=1; dest<nproc; dest++) {
int chunk = (dest <= leftover ? chunksize + 1 : chunksize);
MPI_Recv(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD, &status);
MPI_Recv(&temps[offset], chunk, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD,&status);
MPI_Recv(&newtemps[offset], chunk, MPI_FLOAT, dest, tag3, MPI_COMM_WORLD,&status);
printf("received %d temps from process: %d\n",chunk, dest);
printf("status: %d\n",status.MPI_TAG);
}
max_tmp = -10000;
min_tmp = 10000;
for(idx = 0; idx < boxes; idx++){
temps[idx] = newtemps[idx];
if(newtemps[idx] > max_tmp){
max_tmp = newtemps[idx];
}
if(newtemps[idx] < min_tmp){
min_tmp = newtemps[idx];
}
}
stop = (max_tmp - min_tmp) <= (max_tmp * epsilon);
iter += 1;
}
}
if (rank > 0){
int chunk = (rank <= leftover ? chunksize + 1 : chunksize);
MPI_Recv(&offset, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD, &status);
MPI_Recv(&temps[offset], chunk, MPI_FLOAT, 0, tag2, MPI_COMM_WORLD,&status);
MPI_Recv(&newtemps[offset], chunk, MPI_FLOAT, 0, tag3, MPI_COMM_WORLD,&status);
printf("received %d temps from process: 0\n",chunk);
printf("status: %d\n",status.MPI_TAG);
for(int j = offset; j < offset+chunk; j++){
float weightedtmp = 0;
int perimeter = 0;
int num_iters = neighbors[j][0];
for(int i = 1; i <= num_iters; i++){
weightedtmp += temps[neighbors[j][i]] * mults[j][i];
perimeter += mults[j][i];
}
weightedtmp /= perimeter;
newtemps[j] = temps[j] + (weightedtmp - temps[j] ) * affect_rate;
}
printf("sent %d temps to process: 0\n",chunk);
MPI_Send(&offset, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD);
MPI_Send(&temps[offset], chunk, MPI_FLOAT, 0, tag2, MPI_COMM_WORLD);
MPI_Send(&newtemps[offset], chunk, MPI_FLOAT, 0, tag3, MPI_COMM_WORLD);
}
MPI_Finalize();
My program however is successfully going through the first iteration of the while loop and finding the max value of the while loop (matching my serial version), and then sending the temps, newtemps, and offset variables to each process. Here though my program stalls and the processes never print that they received the message. The console looks like this:
[radeymichael#owens-login04 ~]$ mpicc -o ci changeInput.c
[radeymichael#owens-login04 ~]$ mpirun -np 3 ./ci .1 .1
sent 101 temps to process: 1
sent 100 temps to process: 2
received 101 temps from process: 1
status: 1
received 101 temps from process: 0
status: 1
sent 101 temps to process: 0
received 100 temps from process: 0
status: 1
sent 100 temps to process: 0
received 100 temps from process: 2
status: 1
max: 900.000000
sent 101 temps to process: 1
sent 100 temps to process: 2
I have spent a lot of time trying to find the mistake, but think I am lacking fundamental knowledge to use MPI. If someone can help me find where my misunderstanding is I would greatly appreciate it.

The problem was, rank 0 is inside a while loop and will be sending the data till stop=1 while all the other process will reach the MPI_Finalize after the last MPI_Send in the else part. One solution (as seen in the comment by #Gilles) is to add a while loop based on stop also for all other ranks and broadcast the stop to all the process by the root.
MPI_Bcast(&stop,1, MPI_INT, 0, MPI_COMM_WORLD);
See the below code.
int nproc, rank,chunksize,leftover,offset,source, tag1=3,tag2=2,tag3=1;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
chunksize = (boxes / (nproc-1));
leftover = (boxes % (nproc-1));
int stop = 0;
if(rank == 0){
//init dsv
for(int idx = 0; idx < boxes; idx++){
temps[idx] = newtemps[idx];
}
int iter = 0;
float max_tmp;
float min_tmp;
while(stop != 1){
offset = 0;
for (int dest=1; dest<nproc; dest++) {
int chunk = (dest <= leftover ? chunksize + 1 : chunksize);
MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
MPI_Send(&temps[offset], chunk, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD);
MPI_Send(&newtemps[offset], chunk, MPI_FLOAT, dest, tag3, MPI_COMM_WORLD);
printf("sent %d temps to process: %d\n",chunk, dest);
offset = offset + chunk;
}
for (int dest=1; dest<nproc; dest++) {
int chunk = (dest <= leftover ? chunksize + 1 : chunksize);
MPI_Recv(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD, &status);
MPI_Recv(&temps[offset], chunk, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD,&status);
MPI_Recv(&newtemps[offset], chunk, MPI_FLOAT, dest, tag3, MPI_COMM_WORLD,&status);
printf("received %d temps from process: %d\n",chunk, dest);
printf("status: %d\n",status.MPI_TAG);
}
max_tmp = -10000;
min_tmp = 10000;
for(idx = 0; idx < boxes; idx++){
temps[idx] = newtemps[idx];
if(newtemps[idx] > max_tmp){
max_tmp = newtemps[idx];
}
if(newtemps[idx] < min_tmp){
min_tmp = newtemps[idx];
}
}
stop = (max_tmp - min_tmp) <= (max_tmp * epsilon);
iter += 1;
MPI_Bcast(&stop,1, MPI_INT, 0, MPI_COMM_WORLD);
}
}
if (rank > 0){
while(stop != 1){
int chunk = (rank <= leftover ? chunksize + 1 : chunksize);
MPI_Recv(&offset, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD, &status);
MPI_Recv(&temps[offset], chunk, MPI_FLOAT, 0, tag2, MPI_COMM_WORLD,&status);
MPI_Recv(&newtemps[offset], chunk, MPI_FLOAT, 0, tag3, MPI_COMM_WORLD,&status);
printf("received %d temps from process: 0\n",chunk);
printf("status: %d\n",status.MPI_TAG);
for(int j = offset; j < offset+chunk; j++){
float weightedtmp = 0;
int perimeter = 0;
int num_iters = neighbors[j][0];
for(int i = 1; i <= num_iters; i++){
weightedtmp += temps[neighbors[j][i]] * mults[j][i];
perimeter += mults[j][i];
}
weightedtmp /= perimeter;
newtemps[j] = temps[j] + (weightedtmp - temps[j] ) * affect_rate;
}
printf("sent %d temps to process: 0\n",chunk);
MPI_Send(&offset, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD);
MPI_Send(&temps[offset], chunk, MPI_FLOAT, 0, tag2, MPI_COMM_WORLD);
MPI_Send(&newtemps[offset], chunk, MPI_FLOAT, 0, tag3, MPI_COMM_WORLD);
MPI_Bcast(&stop,1, MPI_INT, 0, MPI_COMM_WORLD);
}
}
MPI_Finalize();

Related

MPI : Sending and receiving dynamically allocated sub-matrix

I have problem with sending dynamically allocated sub-matrix to workers. I can't understand how I can correctly do that (and what I should send).
Here is sending part:
MPI_Send(&(a[offset][0]), rows * NCA, MPI_DOUBLE, dest, FROM_MASTER + 2, MPI_COMM_WORLD);
MPI_Send(&b, NCA * NCB, MPI_DOUBLE, dest, FROM_MASTER + 3, MPI_COMM_WORLD);
Here is receiving part:
MPI_Recv(&(a[0][0]), rows * NCA, MPI_DOUBLE, MASTER, FROM_MASTER + 2, MPI_COMM_WORLD, &status);
MPI_Recv(&(b[0][0]), NCA * NCB, MPI_DOUBLE, MASTER, FROM_MASTER + 3, MPI_COMM_WORLD, &status);
Here is error message:
[pop-os:29368] Read -1, expected 80000, errno = 14
[pop-os:29367] *** Process received signal ***
[pop-os:29367] Signal: Segmentation fault (11)
[pop-os:29367] Signal code: Address not mapped (1)
[pop-os:29367] Failing at address: 0x7fffc2ae8000
Here is all code:
#include <cstdio>
#include <cstdlib>
#include "mpi.h"
#define NRA 100
/* number of rows in matrix A */
#define NCA 100
/* number of columns in matrix A */
#define NCB 100
/* number of columns in matrix B */
#define MASTER 0
/* taskid of first task */
#define FROM_MASTER 1 /* setting a message type */
#define FROM_WORKER 10 /* setting a message type */
double **alloc_2d_int(int rows, int cols) {
double **array= (double **)malloc(rows*sizeof(double*));
for (int i=0; i<rows; i++)
array[i] = (double *)malloc(rows*cols*sizeof(double));
return array;
}
int main(int argc, char *argv[]) {
int numtasks, taskid, numworkers, source, dest, rows,
/* rows of matrix A sent to each worker */
averow, extra, offset, i, j, k;
double **a = alloc_2d_int(NRA, NCA);
double **b = alloc_2d_int(NCA, NCB);
double **c = alloc_2d_int(NRA, NCB);
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
if (numtasks < 2) {
printf("Need at least two MPI tasks. Quitting...\n");
MPI_Abort(MPI_COMM_WORLD, -1);
exit(1);
}
numworkers = numtasks - 1;
if (taskid == MASTER) {
printf("mpi_mm has started with %d tasks (task1).\n", numtasks);
for (i = 0; i < NRA; i++)
for (j = 0; j < NCA; j++) a[i][j] = 10;
for (i = 0; i < NCA; i++)
for (j = 0; j < NCB; j++) b[i][j] = 10;
double t1 = MPI_Wtime();
averow = NRA / numworkers;
extra = NRA % numworkers;
offset = 0;
for (dest = 1; dest <= numworkers; dest++) {
rows = (dest <= extra) ? averow + 1 : averow;
printf("Sending %d rows to task %d offset=%d\n", rows, dest, offset);
MPI_Send(&offset, 1, MPI_INT, dest, FROM_MASTER, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, FROM_MASTER + 1, MPI_COMM_WORLD);
MPI_Send(&(a[offset][0]), rows * NCA, MPI_DOUBLE, dest, FROM_MASTER + 2,
MPI_COMM_WORLD);
MPI_Send(&b, NCA * NCB, MPI_DOUBLE, dest, FROM_MASTER + 3,
MPI_COMM_WORLD);
offset = offset + rows;
}
/* Receive results from worker tasks */
for (source = 1; source <= numworkers; source++) {
MPI_Recv(&offset, 1, MPI_INT, source, FROM_WORKER, MPI_COMM_WORLD,
&status);
MPI_Recv(&rows, 1, MPI_INT, source, FROM_WORKER + 1, MPI_COMM_WORLD,
&status);
MPI_Recv(&(c[offset][0]), rows * NCB, MPI_DOUBLE, source, FROM_WORKER + 2,
MPI_COMM_WORLD, &status);
printf("Received results from task %d\n", source);
}
/* Print results */
/*
printf("****\n");
printf("Result Matrix:\n");
for (i = 0; i < NRA; i++)
{
printf("\n");
for (j = 0; j < NCB; j++) printf("%6.2f ", c[i][j]);
}*/
printf("\n********\n");
printf("Done.\n");
t1 = MPI_Wtime() - t1;
printf("\nExecution time: %.2f\n", t1);
}
/******** worker task *****************/
else { /* if (taskid > MASTER) */
MPI_Recv(&offset, 1, MPI_INT, MASTER, FROM_MASTER, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, FROM_MASTER + 1, MPI_COMM_WORLD,
&status);
MPI_Recv(&(a[0][0]), rows * NCA, MPI_DOUBLE, MASTER, FROM_MASTER + 2,
MPI_COMM_WORLD, &status);
MPI_Recv(&(b[0][0]), NCA * NCB, MPI_DOUBLE, MASTER, FROM_MASTER + 3, MPI_COMM_WORLD,
&status);
for (k = 0; k < NCB; k++)
for (i = 0; i < rows; i++) {
c[i][k] = 0.0;
for (j = 0; j < NCA; j++) c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
MPI_Send(&offset, 1, MPI_INT, MASTER, FROM_WORKER, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, FROM_WORKER + 1, MPI_COMM_WORLD);
MPI_Send(&c, rows * NCB, MPI_DOUBLE, MASTER, FROM_WORKER + 2,
MPI_COMM_WORLD);
}
for (i=0; i<NRA; i++)
free(a[i]);
free(a);
for (i=0; i<NCA; i++)
free(b[i]);
free(b);
for (i=0; i<NRA; i++)
free(c[i]);
free(c);
MPI_Finalize();
}
Solution: link to GitHub with correct code

MPI_Send and MPI_recv first parameter is const void * so you need to change:
MPI_Send(&b, NCA * NCB, MPI_DOUBLE, dest, FROM_MASTER + 3, MPI_COMM_WORLD);
to
MPI_Send(b, NCA * NCB, MPI_DOUBLE, dest, FROM_MASTER + 3, MPI_COMM_WORLD);
and
MPI_Send(&c, rows * NCB, MPI_DOUBLE, MASTER, FROM_WORKER + 2, MPI_COMM_WORLD);
to
MPI_Send(c, rows * NCB, MPI_DOUBLE, MASTER, FROM_WORKER + 2, MPI_COMM_WORLD);
Another issue that you have is that you are allocating a array of pointers:
double **alloc_2d_int(int rows, int cols) {
double **array= (double **)malloc(rows*sizeof(double*));
for (int i=0; i<rows; i++)
array[i] = (double *)malloc(rows*cols*sizeof(double));
return array;
}
But the data to be send/recv on the MPI_Send and MPI_Recv are assumed to be continuously. To solve this, you can create a continuously 2D array, simply represent the matrix as an array, create a MPI custom type, among others.

MPI Vector multiplication

#include<stdio.h>
#include<mpi.h>
int main()
{
int a_r = 0, a_c = 0, v_s = 0, i = 0, rank = 0, size = 0;
int local_row = 0, partial_sum = 0, sum = 0, j = 0;
int my_first_ele = 0, my_last_ele = 0;
int a[10][10], v[10], partial_mul[10] = {0}, mul[10] = {0};
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(rank == 0)
{
printf("Enter the row of array A: ");
scanf("%d", &a_r);
printf("Enter the column of array A: ");
scanf("%d", &a_c);
printf("Enter the array A: ");
for(i = 0; i < a_r; i++)
{
for(j = 0; j < a_c; j++)
scanf("%d", &a[i][j]);
}
printf("Enter the size of vector array: ");
scanf("%d", &v_s);
printf("Enter the vector array: ");
for(i = 0; i < v_s; i++)
{
scanf("%d", &v[i]);
}
MPI_Bcast(&a_r, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&a_c, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&v_s, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(a, a_r*a_c, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(v, v_s, MPI_INT, 0, MPI_COMM_WORLD);
local_row = a_r / size;
my_first_ele = rank * local_row;
my_last_ele = (rank+1) * local_row;
if(a_c == v_s)
{
for(i = my_first_ele; i < my_last_ele; i++)
{
for(j = 0; j < a_c; j++)
{
partial_mul[i] = partial_mul[i] + (a[i][j]*v[j]);
}
}
printf("\nPartial multiplication in Rank 0: \n");
for(i = my_first_ele; i < my_last_ele; i++)
printf("%d \n", partial_mul[i]);
MPI_Gather(partial_mul, local_row, MPI_INT, mul, local_row, MPI_INT, 0, MPI_COMM_WORLD);
printf("\n \nGlobal Multiplication: \n");
for(i = 0; i < a_r; i++)
{
printf("%d \n", mul[i]);
}
}
else
printf("\nCan't multiply. \n");
}
else
{
MPI_Bcast(&a_r, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&a_c, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&v_s, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(a, a_r*a_c, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(v, v_s, MPI_INT, 0, MPI_COMM_WORLD);
local_row = a_r / size;
my_first_ele = rank * local_row;
my_last_ele = (rank+1) * local_row;
if(a_c == v_s)
{
for(i = my_first_ele; i < my_last_ele; i++)
{
for(j = 0; j < a_c; j++)
{
partial_mul[i] = partial_mul[i] + (a[i][j]*v[j]);
}
}
printf("\nPartial multiplication in Rank %d: \n", rank);
for(i = my_first_ele; i < my_last_ele; i++)
printf("%d \n", partial_mul[i]);
MPI_Gather(partial_mul, local_row, MPI_INT, mul, local_row, MPI_INT, 0, MPI_COMM_WORLD);
}
else
printf("\nCan't multiply. \n");
}
MPI_FINALIZE();
}
I have a problem with above code. My partial multiplication value is correct. But in my overall multiplication I can only gather rank 0 elements, rest of the values are being printed as 0. What is the problem can anyone explain.

Looking at your data layout I think you misunderstand data structures in MPI: All data is kept separate in each rank, there is no implict sharing or distribution. Your vector partial_sum is separate on each rank, each with the full 10 elements. So assuming size=2, a_r=10 and zero initialization, after the computation the contents will look like this:
rank 0: {x0,x1,x2,x3,x4,0,0,0,0,0}
rank 1: {0,0,0,0,0,x5,x6,x7,x8,x9}
Where x is the correct computed value. Gather will then collect the first local_row=5 elements from each rank, resulting in {x0,x1,x2,x3,x4,0,0,0,0,0}.
You could just fix this by adding the correct offset:
MPI_Gather(&partial_mul[my_first_ele], local_row, MPI_INT, mul, local_row, MPI_INT, 0, MPI_COMM_WORLD);
But please don't do that. Instead you should reconsider your data structures to really distribute the data, reserve the correct sizes for each part of the vector / array. To send around parts of the data to each rank, use MPI_Scatter (the opposite of MPI_Gather). The most difficult is to get the matrix right. This is explained in detail by this excellent answer.

Program stops at MPI_Send

Program stops working, when I execute it with more than 1 processor.
It stops at first MPI_Send
What am I doing wrong?
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SIZE 200000
#define SIZE2 256
#define VYVOD 1
int main(int argc, char *argv[])
{
int NX, NT;
double TK, UM, DX, DY, DT;
double starttime, endtime;
int numnode, rank, delta=0, ierr, NXnode;
double **U;
double **U1;
double *sosed1;
double *sosed2;
int i, j, k;
MPI_Status stats;
NX = 1*(SIZE2+1);
TK = 20.00;
UM = 10.0;
DX = 0.1;
DY = DX;
DT = 0.1;
NT = (TK/DT);
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numnode);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
if(rank == 0)
printf("\nTotal nodes: %d\n", numnode);
NX = NX - 2;
NXnode = (NX-(NX%numnode))/numnode;
if (rank < (NX%numnode))
{
delta = rank * NXnode + rank + 1;
NXnode++;
}
else
{
delta = rank * NXnode + (NX%numnode) + 1;
}
if(rank == 0){
printf("Order counting complete, NXnode = %d\n", NXnode);
}
U = (double**)malloc(NXnode*sizeof(double*));
U1 = (double**)malloc(NXnode*sizeof(double*));
sosed1 = (double*)malloc(SIZE*sizeof(double));
sosed2 = (double*)malloc(SIZE*sizeof(double));
for (i=0; i < NXnode; i++)
{
U[i] = (double*)malloc(SIZE*sizeof(double));
U[i][0]=0;
U[i][SIZE-1]=0;
U1[i] = (double*)malloc(SIZE*sizeof(double));
U1[i][0]=0;
U1[i][SIZE-1]=0;
if (U[i]==NULL || U1[i]==NULL)
{
printf("Error at memory allocation!");
return 1;
}
}
MPI_Barrier(MPI_COMM_WORLD);
if(rank == 0){
starttime = MPI_Wtime();
printf("Array allocation complete\n");
}
for (i = 0; i < NXnode; i++)
{
for (j = 1; j < SIZE-1; j++)
{
if ((delta)<=(NXnode/2))
{
U1[i][j]=2*(UM/NXnode)*(delta+i);
}
else
{
U1[i][j]=-2*(UM/NXnode) + 2*UM;
}
}
}
printf("Array init 1 complete, rank %d\n", rank);
MPI_Barrier(MPI_COMM_WORLD);
if (rank > 0)
{
MPI_Send(&(U1[0][0]), SIZE, MPI_DOUBLE , rank-1, 0, MPI_COMM_WORLD);
MPI_Recv(&(sosed1[0]), SIZE, MPI_DOUBLE , rank-1, 1, MPI_COMM_WORLD, &stats);
}
else
{
int initInd = 0;
for (initInd = 0; initInd < SIZE; initInd++)
{
sosed1[initInd]=0;
}
}
if (rank < (numnode-1))
{
MPI_Send(&(U1[NXnode-1][0]), SIZE, MPI_DOUBLE , rank+1, 1, MPI_COMM_WORLD);
MPI_Recv(&(sosed2[0]), SIZE, MPI_DOUBLE , rank+1, 0, MPI_COMM_WORLD, &stats);
}
else
{
int initInd = 0;
for (initInd = 0; initInd < SIZE; initInd++)
{
sosed2[initInd]=0;
}
}
printf("Send complete, rank %d\n", rank);
MPI_Barrier(MPI_COMM_WORLD);
printf("Array init complete, rank %d\n", rank);
for (k = 1; k <= NT; k++)
{
int cycle = 0;
for (cycle=1; cycle < SIZE-1; cycle++)
{
U[0][cycle] = U1[0][cycle] + DT/(DX*DX)*(U1[1][cycle]-2*U1[0][cycle]+sosed1[cycle])+DT/(DY*DY)*(U1[0][cycle+1]+U1[0][cycle-1]-(U1[0][cycle]*2));
}
for (i=1; i<NXnode-1; i++)
{
for(j=1; j<SIZE-1; j++)
{
U[i][j] = U1[i][j] + DT/(DX*DX)*(U1[i+1][j]-2*U1[i][j]+U[i-1][j])+DT/(DY*DY)*(U1[i][j+1]+U1[i][j-1]-(U1[i][j]*2));
}
}
for (cycle=1; cycle < SIZE-1; cycle++)
{
U[NXnode-1][cycle]=U1[NXnode-1][cycle]+DT/(DX*DX)*(sosed2[cycle]-2*U1[NXnode-1][cycle]+U1[NXnode-2][cycle])+DT/(DY*DY)*(U1[NXnode-1][cycle+1]+U1[NXnode-1][cycle-1]-(U1[NXnode-1][cycle]*2));
}
/*U[0] = U1[0]+DT/(DX*DX)*(U1[0+1]-2*U1[0]+sosed1);
for (j = 0; j<NXnode; j++)
{
U[j]=U1[j]+DT/(DX*DX)*(U1[j+1]-2*U1[j]+U1[j-1]);
}
U[NXnode-1]=U1[NXnode-1]+DT/(DX*DX)*(sosed2-2*U1[NXnode-1]+U1[(NXnode-1)-1]);*/
if (rank > 0)
{
MPI_Send(&(U[0][0]), SIZE, MPI_DOUBLE , rank-1, 0, MPI_COMM_WORLD);
}
if (rank < (numnode-1))
{
MPI_Send(&(U[NXnode-1][0]), SIZE, MPI_DOUBLE , rank+1, 0, MPI_COMM_WORLD);
}
if (rank > 0)
{
MPI_Recv(&(sosed1[0]), SIZE, MPI_DOUBLE , rank-1, 0, MPI_COMM_WORLD, &stats);
}
if (rank < (numnode-1))
{
MPI_Recv(&(sosed2[0]), SIZE, MPI_DOUBLE , rank+1, 0, MPI_COMM_WORLD, &stats);
}
for (i = 0; i<NXnode; i++)
{
for (j=0; j<SIZE; j++)
{
U1[i][j]=U[i][j];
}
}
}
MPI_Barrier(MPI_COMM_WORLD);
printf("Array count complete, rank %d\n", rank);
if (rank == 0)
{
endtime=MPI_Wtime();
printf("\n## TIME: %f\n", endtime-starttime);
}
MPI_Finalize();
}
UPDATE#1
Tried it like that, so rank 0 would be the first, still doesn't work:
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0 && numnode > 1)
{
MPI_Recv(&(sosed2[0]), SIZE, MPI_DOUBLE , rank+1, 0, MPI_COMM_WORLD, &stats);
MPI_Send(&(U1[NXnode-1][0]), SIZE, MPI_DOUBLE , rank+1, 1, MPI_COMM_WORLD);
int initInd = 0;
for (initInd = 0; initInd < SIZE; initInd++)
{
sosed1[initInd]=0;
}
}
else if (rank == 0)
{
int initInd = 0;
for (initInd = 0; initInd < SIZE; initInd++)
{
sosed2[initInd]=0;
sosed1[initInd]=0;
}
}
else if (rank < (numnode-1))
{
MPI_Send(&(U1[0][0]), SIZE, MPI_DOUBLE , rank-1, 1, MPI_COMM_WORLD);
MPI_Recv(&(sosed1[0]), SIZE, MPI_DOUBLE , rank-1, 0, MPI_COMM_WORLD, &stats);
MPI_Recv(&(sosed2[0]), SIZE, MPI_DOUBLE , rank+1, 0, MPI_COMM_WORLD, &stats);
MPI_Send(&(U1[NXnode-1][0]), SIZE, MPI_DOUBLE , rank+1, 1, MPI_COMM_WORLD);
}
else if (rank == (numnode - 1))
{
MPI_Send(&(U1[0][0]), SIZE, MPI_DOUBLE , rank-1, 1, MPI_COMM_WORLD);
MPI_Recv(&(sosed1[0]), SIZE, MPI_DOUBLE , rank-1, 0, MPI_COMM_WORLD, &stats);
int initInd = 0;
for (initInd = 0; initInd < SIZE; initInd++)
{
sosed2[initInd]=0;
}
}
UPDATE#2
Solved, used same tag for all Send/Recv.

MPI_Send is blocking the execution until the corresponding MPI_Recv is invoked (presumably in another process).
In your program, all processes except rank=0 are calling MPI_Send immediately after the first barrier, and no one is ready to Recv the message, so MPI_Send blocks infinitely. Essentially, every process is waiting for its message to be accepted by the process with the lower rank (rank 2 is waiting for rank 1, rank 1 is waiting for rank 0), and rank 0 is not accepting any messages at all (it goes to the next block of code and in turn calls MPI_Send too), so everything just hangs.
It looks like you are missing communication part for the process with rank=0 (it should do something like MPI_Recv(from rank 1); ...; MPI_Send(to rank 1);.
Another thing is that you use MPI_Send with tag 1, but call MPI_Recv with tag 0. This won't couple. You need to use the same tag, or to specify MPI_TAG_ANY in the receive operation.

MPI error: expected expression before ‘,’ token

i have a strange error when using MPI_Send --i get this error when trying to send a portion of a bi-dimensional array (matrix): "MPI_matrixMultiplication.c:68:99: error: expected expression before ‘,’ token".
The specific line is the one where i try to send a portion if the matrix: MPI_Send(&a[beginPosition][0],... );
(and as you can see, i have commented the other send and receive related with the matrix.
/////////////////////////////////////////////////////////
// multiplication of 2 matrices, parallelized using MPI //
/////////////////////////////////////////////////////////
#include <stdio.h>
#include <mpi.h>
// must use #define here, and not simply int blahblahblah, because "c" doesnt like ints for array dimension :(
#define matrixARowSize 3 // size of the row for matrix A
#define matrixAColumnSize 3 // size of the column for matrix A
#define matrixBRowSize 3 // size of the row for matrix B
#define matrixBColumnSize 3 // size of the column for matrix B
// tags used for sending/receiving data:
#define LOWER_BOUND 1 // first line to be processed
#define UPPER_BOUND 2 // last line to be processed
#define DATA // data to be processed
int a[matrixARowSize][matrixAColumnSize]; // matrix a
int b[matrixBRowSize][matrixBColumnSize]; // matrix b
int c[matrixARowSize][matrixBColumnSize]; // matrix c
int main()
{
int currentProcess; // current process
int worldSize; // world size
int i, j, k; // iterators
int rowsComputedPerProcess; // how many rows of the first matrix should be computed in each process
int numberOfSlaveProcesses; // the number of slave processes
int processesUsed; //how many processes of the available ones are actually used
MPI_Init(NULL, NULL); // MPI_Init()
MPI_Comm_size(MPI_COMM_WORLD, &worldSize); // get the world size
MPI_Comm_rank(MPI_COMM_WORLD, &currentProcess); // get current process
numberOfSlaveProcesses = worldSize - 1; // 0 is the master, rest are slaves
rowsComputedPerProcess = worldSize > matrixARowSize ? 1 : (matrixARowSize/numberOfSlaveProcesses);
processesUsed = worldSize > matrixARowSize ? matrixARowSize : numberOfSlaveProcesses;
/*
* in the first process (the father);
* initialize the 2 matrices, then start splitting the data to the slave processes
*/
if (!currentProcess) // in father process
{
printf("rows per process: %d\n", rowsComputedPerProcess);
printf("nr of processes used: %d\n", processesUsed);
// init matrix A
for(i = 0; i < matrixARowSize; ++i)
for(j = 0; j < matrixAColumnSize; ++j){
a[i][j] = i + j + 1;
// printf("%d\n", a[i][j]);
// printf("%d\n", *(a[i] + j));
}
// init matrix B
for(i = 0; i < matrixBRowSize; ++i)
for(j = 0; j < matrixBColumnSize; ++j)
b[i][j] = i + j + 1;
// start sending data to the slaves for them to work >:)
int beginPosition; // auxiliary values used for sending the offsets to slaves
int endPosition;
for(i = 1; i < processesUsed; ++i) // the last process is dealt with separately
{
beginPosition = (i - 1)*rowsComputedPerProcess;
endPosition = i*rowsComputedPerProcess;
MPI_Send(&beginPosition, 1, MPI_INT, i, LOWER_BOUND, MPI_COMM_WORLD);
MPI_Send(&endPosition, 1, MPI_INT, i, UPPER_BOUND, MPI_COMM_WORLD);
MPI_Send(&a[beginPosition][0], ((endPosition - beginPosition)*matrixARowSize), MPI_INT, i, DATA, MPI_COMM_WORLD);
// MPI_Send(a[beginPosition], (endPosition - beginPosition)*matrixARowSize, MPI_INT, i, DATA, MPI_COMM_WORLD);
// for(j = beginPosition; j < endPosition; ++j)
// for (k = 0; k < matrixAColumnSize; ++k)
// {
// printf("%d ", *(a[j] + k));
// }
// printf("\n");
// printf("beg: %d, end: %d\n", beginPosition, endPosition);
// printf(" data #%d\n", (endPosition - beginPosition)*matrixARowSize);
}
// deal with last process
beginPosition = (i - 1)*rowsComputedPerProcess;
endPosition = matrixARowSize;
MPI_Send(&beginPosition, 1, MPI_INT, i, LOWER_BOUND, MPI_COMM_WORLD);
MPI_Send(&endPosition, 1, MPI_INT, i, UPPER_BOUND, MPI_COMM_WORLD);
// MPI_Send(a[beginPosition], (endPosition - beginPosition)*matrixARowSize, MPI_INT, i, DATA, MPI_COMM_WORLD);
// printf("beg: %d, end: %d\n", beginPosition, endPosition);
// printf(" data #%d\n", (endPosition - beginPosition)*matrixARowSize);
}
else { // if this is a slave (rank > 0)
int beginPosition; // auxiliary values used for sending the offsets to slaves
int endPosition;
MPI_Recv(&beginPosition, 1, MPI_INT, 0, LOWER_BOUND, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(&endPosition, 1, MPI_INT, 0, UPPER_BOUND, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
// MPI_Recv(a[beginPosition], (endPosition - beginPosition)*matrixARowSize, 0, DATA, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(i = beginPosition; i < endPosition; ++i) {
for (j = 0; j < matrixAColumnSize; ++j)
printf("(# %d, i=%d, j=%d: %d ", currentProcess, i, j, a[i][j]);
// printf("\n");
}
}
MPI_Finalize();
return 0; // bye-bye
}

Your DATA constant is empty.
#define DATA // data to be processed
So you're trying to do :
MPI_Send(&a[beginPosition][0], ((endPosition - beginPosition)*matrixARowSize), MPI_INT, i, , MPI_COMM_WORLD);
Which logically generates an expected expression before ',' token error.

Problem with multiplying matrices in parallel

I am having trouble with using MPI for multiplying matrices.
I have the program reading two n x n matrices from two files and am supposed to use MPI. But I am getting a segmentation fault in one of the processes. This is the output I get when I run my code:
read matrix A from matrixA
read matrix B from matrixB
mpirun noticed that process rank 1 with PID 15599 on node VirtualBox exited on signal 11 (Segmentation fault).
Here is my code:
int main (int argc, char * argv[])
{
/* Check the number of arguments */
int n; /* Dimension of the matrix */
float *sa, *sb, *sc; /* Storage for matrix A, B, and C */
float **a, **b, **c; /* 2D array to access matrix A, B, and C */
int i, j, k;
MPI_Init(&argc, &argv); //Initialize MPI operations
MPI_Comm_rank(MPI_COMM_WORLD, &rank); //Get the rank
MPI_Comm_size(MPI_COMM_WORLD, &size); //Get number of processes
if(argc != 4) {
printf("Usage: %s fileA fileB fileC\n", argv[0]);
return 1;
}
if(rank == 0)
{
/* Read matrix A */
printf("read matrix A from %s\n", argv[1]);
read_matrix(argv[1], &a, &sa, &i, &j);
if(i != j) {
printf("ERROR: matrix A not square\n"); return 2;
}
n = i;
//printf("%d", n);
/* Read matrix B */
printf("Read matrix B from %s\n", argv[2]);
read_matrix(argv[2], &b, &sb, &i, &j);
if(i != j) {
printf("ERROR: matrix B not square\n");
return 2;
}
if(n != i) {
printf("ERROR: matrix A and B incompatible\n");
return 2;
}
}
printf("test");
if(rank == 0)
{
/* Initialize matrix C */
sc = (float*)malloc(n*n*sizeof(float));
memset(sc, 0, n*n*sizeof(float));
c = (float**)malloc(n*sizeof(float*));
for(i=0; i<n; i++) c[i] = &sc[i*n];
}
////////////////////////////////////////////////////////////////////////////////////////////
float matrA[n][n];
float matrB[n][n];
float matrC[n][n];
for(i = 0; i < n; i++)
{
for(j = 0; j < n; j++)
{
matrA[i][j] = sa[(i*n) + j];
matrB[i][j] = sb[(i*n) + j];
}
}
/* Master initializes work*/
if (rank == 0)
{
start_time = MPI_Wtime();
for (i = 1; i < size; i++)
{
//For each slave other than the master
portion = (n / (size - 1)); // Calculate portion without master
low_bound = (i - 1) * portion;
if (((i + 1) == size) && ((n % (size - 1)) != 0))
{
//If rows of [A] cannot be equally divided among slaves,
upper_bound = n; //the last slave gets all the remaining rows.
}
else
{
upper_bound = low_bound + portion; //Rows of [A] are equally divisable among slaves
}
//Send the low bound first without blocking, to the intended slave.
MPI_Isend(&low_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &request);
//Next send the upper bound without blocking, to the intended slave
MPI_Isend(&upper_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &request);
//Finally send the allocated row portion of [A] without blocking, to the intended slave
MPI_Isend(&matrA[low_bound][0], (upper_bound - low_bound) * n, MPI_FLOAT, i, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &request);
}
}
//broadcast [B] to all the slaves
MPI_Bcast(&matrB, n*n, MPI_FLOAT, 0, MPI_COMM_WORLD);
/* work done by slaves*/
if (rank > 0)
{
//receive low bound from the master
MPI_Recv(&low_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &status);
//next receive upper bound from the master
MPI_Recv(&upper_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &status);
//finally receive row portion of [A] to be processed from the master
MPI_Recv(&matrA[low_bound][0], (upper_bound - low_bound) * n, MPI_FLOAT, 0, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &status);
for (i = low_bound; i < upper_bound; i++)
{
//iterate through a given set of rows of [A]
for (j = 0; j < n; j++)
{
//iterate through columns of [B]
for (k = 0; k < n; k++)
{
//iterate through rows of [B]
matrC[i][j] += (matrA[i][k] * matrB[k][j]);
}
}
}
//send back the low bound first without blocking, to the master
MPI_Isend(&low_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &request);
//send the upper bound next without blocking, to the master
MPI_Isend(&upper_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG + 1, MPI_COMM_WORLD, &request);
//finally send the processed portion of data without blocking, to the master
MPI_Isend(&matrC[low_bound][0],
(upper_bound - low_bound) * n,
MPI_FLOAT,
0,
SLAVE_TO_MASTER_TAG + 2,
MPI_COMM_WORLD,
&request);
}
/* Master gathers processed work*/
if (rank == 0)
{
for (i = 1; i < size; i++)
{
// Until all slaves have handed back the processed data,
// receive low bound from a slave.
MPI_Recv(&low_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &status);
//Receive upper bound from a slave
MPI_Recv(&upper_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG + 1, MPI_COMM_WORLD, &status);
//Receive processed data from a slave
MPI_Recv(&matrC[low_bound][0],
(upper_bound - low_bound) * n,
MPI_FLOAT,
i,
SLAVE_TO_MASTER_TAG + 2,
MPI_COMM_WORLD,
&status);
}
end_time = MPI_Wtime();
printf("\nRunning Time = %f\n\n", end_time - start_time);
}
MPI_Finalize(); //Finalize MPI operations
/* Do the multiplication */
//////////////////////////////////////////////////// matmul(a, b, c, n);
for(i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
sc[(i*n) + j] = matrC[i][j];
}
}
}

Every process declares the pointers to the matrices, namely:
float *sa, *sb, *sc; /* storage for matrix A, B, and C */
but only the process 0 (allocates and) fills up the arrays sa and sb:
if(rank == 0)
{
...
read_matrix(argv[1], &a, &sa, &i, &j);
...
read_matrix(argv[2], &b, &sb, &i, &j);
...
}
However, afterward every process tries to access the positions of the sa and sb array:
for(i = 0; i < n; i++)
{
for(j = 0; j < n; j++)
{
matrA[i][j] = sa[(i*n) + j];
matrB[i][j] = sb[(i*n) + j];
}
}
Since only the process 0 had (allocated and) filled up the arrays sa and sb, the remaining processes are trying to access memory (sa[(i*n) + j] and sb[(i*n) + j]) that they have not allocated. Hence, the reason why you get segmentation fault.
On a side note, there is another problem in your program - you initiate non-blocking sends with MPI_Isend but never wait on the completion of the returned request handles. MPI implementations are not even required to start the send operation until it is properly progressed to completion, mostly by a call to one of the wait or test operations (MPI_Wait, MPI_Waitsome, MPI_Waitall, and so on). Even worse, you reuse the same handle variable request, effectively losing the handles to all previously initiated requests, which makes them unwaitable/untestable. Use an array of requests instead and wait for all of them to finish with MPI_Waitall after the send loop.
Also think about this -
do you really need non-blocking operations to send data back from the workers?