distributed algorithm in C - c

I am a beginner in C. I have to create a distributed architecture with the library MPI. The following code is:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int N, w = 1, L = 2, M = 50; // with N number of threads
int T= 2;
int myid;
int buff;
float mit[N][T]; // I initialize a 2d array
for(int i = 0; i < N; ++i){
mit[i][0]= M / (float) N;
for (int j = 1; j < T; ++j){
mit[i][j] = 0;
}
}
float tab[T]; // 1d array
MPI_Status stat;
/*********************************************
start
*********************************************/
MPI_Init(&argc,&argv); // Initialisation
MPI_Comm_size(MPI_COMM_WORLD, &N);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for(int j = 0; j < T; j++) {
for(int i = 0; i < N; i++) { // I iterate for each slave
if (myid !=0) {
float y = ((float) rand()) / (float) RAND_MAX;
mit[i][j + 1] = mit[i][j]*(1 + w * L * y);
buff=mit[i][j+1];
MPI_Send(&buff, 128, MPI_INT, 0, 0, MPI_COMM_WORLD); // I send the variable buff to the master
buff=0;
}
if( myid == 0 ) { // Master
for(int i = 1; i < N; i++){
MPI_Recv(&buff, 128, MPI_INT, i, 0, MPI_COMM_WORLD, &stat);
tab[j] += buff; // I need to receive all the variables buff sent by the salves, sum them and stock into the tab at the index j
}
printf("\n%.20f\n",tab[j]); // I print the result of the sum at index j
}
}
}
MPI_Finalize();
return 0;
}
}
I use the command in the terminal: mpicc .c -o my_file to compile the program
Then mpirun -np 101 my_file_c to start the program with 101 threads
But the problem is I have the following error int the terminal:
It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
It seems that I have a problem with the master but i don't know why...
Any idea ???
Thank you :)

This behavior is very likely the result of a memory corruption.
You cannot
int buff=mit[i][j+1];
MPI_Send(&buff, 128, MPI_INT, ...);
depending on what you want to achieve, you can try instead
int buff=mit[i][j+1];
MPI_Send(&buff, 1, MPI_INT, ...);
// ...
MPI_Recv(&buff, 1, MPI_INT, ...);
or
int *buff=&mit[i][j+1];
MPI_Send(buff, 128, MPI_INT, ...);
// fix MPI_Recv()

Related

MPI_Get doesn't send the correct elements between the buffers of two process

I am trying to create a program that will ultimately be transposing a matrix in MPI so that it can be used in further computations. But right now I am trying to do a simple thing: Root process has a 4x4 matrix "A" which contains elements 0..15 in row-major order. This data is scattered to 2 processes so that each receives one half of the matrix. Process 0 has a 2x4 sub_matrix "a" and receives elements 0..7 and Process 1 gets elements 8..15 in its sub_matrix "a".
My goal is for these processes to swap their a matrices with each other using MPI_Get. Since I was encountering problems, I decided to test a simpler version and simply make process 0 get process 1's "a" matrix, that way, both processes will have the same elements in their respective sub_matrices once I print after the MPI_Get-call and the MPI_fence are called.
Yet the output is erratic, have tried to trouble-shoot for several hours but haven't been able to crack the nut. Would appreciate your help with this.
This is the code below, and the run-command: mpirun -n 2 ./get
Compile: mpicc -std=c99 -g -O3 -o get get.c -lm
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define NROWS 4
#define NCOLS 4
int allocate_matrix(int ***M, int ROWS, int COLS) {
int *p;
if (NULL == (p = malloc(ROWS * COLS * sizeof(int)))) {
perror("Couldn't allocate memory for input (p in allocate_matrix)");
return -1;
}
if (NULL == (*M = malloc(ROWS * sizeof(int*)))) {
perror("Couldn't allocate memory for input (M in allocate_matrix)");
return -1;
}
for(int i = 0; i < ROWS; i++) {
(*M)[i] = &(p[i * COLS]);
}
return 0;
}
int main(int argc, char *argv[])
{
int rank, nprocs, **A, **a, n_cols, n_rows, block_len;
MPI_Win win;
int errs = 0;
if(rank==0)
{
allocate_matrix(&A, NROWS, NCOLS);
for (int i=0; i<NROWS; i++)
for (int j=0; j<NCOLS; j++)
A[i][j] = i*NCOLS + j;
}
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
n_cols=NCOLS; //cols in a sub_matrix
n_rows=NROWS/nprocs; //rows in a sub_matrix
block_len = n_cols*n_rows;
allocate_matrix(&a, n_rows, n_cols);
for (int i = 0; i <n_rows; i++)
for (int j = 0; j < n_cols; j++)
a[i][j] = 0;
MPI_Datatype block_type;
MPI_Type_vector(n_rows, n_cols, n_cols, MPI_INTEGER, &block_type);
MPI_Type_commit(&block_type);
MPI_Scatter(*A, 1, block_type, &(a[0][0]), block_len, MPI_INTEGER, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
printf("process %d: \n", rank);
for (int j=0; j<n_rows; j++){
for (int i=0; i<n_cols; i++){
printf("%d ",a[j][i]);
}
printf("\n");
}
if (rank == 0)
{
printf("TESTING, before Get a[0][0] %d\n", a[0][0]);
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
MPI_Get(*a, 8, MPI_INTEGER, 1, 0, 8, MPI_INTEGER, win);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
printf("TESTING, after Get a[0][0] %d\n", a[0][0]);
printf("process %d:\n", rank);
for (int j=0; j<n_rows; j++){
for (int i=0; i<n_cols; i++){
printf("%d ", a[j][i]);
}
printf("\n");
}
}
else
{ /* rank = 1 */
MPI_Win_create(a, n_rows*n_cols*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);
MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
}
MPI_Type_free(&block_type);
MPI_Win_free(&win);
MPI_Finalize();
return errs;
}
This is the output that I get:
process 0:
0 1 2 3
4 5 6 7
process 1:
8 9 10 11
12 13 14 15
process 0:
1552976336 22007 1552976352 22007
1552800144 22007 117 0
But what I want is for the second time I print the matrix from process 0, it should have the same elements as in process 1.
First, I doubt this is really the code you are testing. You are freeing some MPI type variables that are not defined and also rank is uninitialised in
if(rank==0)
{
allocate_matrix(&A, NROWS, NCOLS);
for (int i=0; i<NROWS; i++)
for (int j=0; j<NCOLS; j++)
A[i][j] = i*NCOLS + j;
}
and the code segfaults because A won't get allocated in the root.
Moving this post MPI_Comm_rank(), freeing the correct MPI type variable, and fixing the call to MPI_Win_create in rank 1:
MPI_Win_create(&a[0][0], n_rows*n_cols*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// This -------^^^^^^^^
produces the result you are seeking.
I'd recommend to stick to a single notation for the beginning of the array like &a[0][0] instead of a mixture of *a and &a[0][0]. This will prevent (or at least reduce the occurrence of) similar errors in the future.

MPI matrix multiplication

I'm trying to make an MPI matrix multiplication program but the scatter function doesn't seem to be working for me. Only one row is getting scattered and the rest of the cores receive garbage value.
Also when calling the display_matrix() function before I MPI_Init() seems to be running 4 threads instead of 1 (I have quad core CPU). Why is this happening even before initialisation?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include<mpi.h>
int **matrix_generator(int row,int col);
int **multiply_matrices(int **matrix_A,int **matrix_B,int rowsA, int colsA,int rowsB,int colsB);
void display_matrix(int **matrixA,int rows,int cols);
void main(int argc,char *argv[])
{
srand(time(0));
int **matrix_A,**matrix_B,**matrix_result,*scattered_matrix,*gathered_matrix, rowsA,colsA,rowsB,colsB,world_rank,world_size,i,j;
rowsA = atoi(argv[1]);
colsA = atoi(argv[2]);
rowsB = atoi(argv[3]);
colsB = atoi(argv[4]);
scattered_matrix = (int *)malloc(sizeof(int) * rowsA*colsA/4);
if (argc != 5)
{
fprintf(stderr,"Usage: mpirun -np <No. of processors> ./a.out <Rows A> <Columns A> <Rows B> <Columns B>\n");
exit(-1);
}
else if(colsA != rowsB)
{
printf("Check the dimensions of the matrices!\n\n");
}
matrix_A = matrix_generator(rowsA,colsA);
matrix_B = matrix_generator(rowsB,colsB);
display_matrix(matrix_A,rowsA,colsA);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Scatter(matrix_A, rowsA*colsA/4, MPI_INT, scattered_matrix, rowsA*colsA/4, MPI_INT, 0, MPI_COMM_WORLD);
for(i=0;i<world_size;i++)
{
printf("Scattering data %d from root to: %d \n",scattered_matrix[i],world_rank);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
int **matrix_generator(int row, int col)
{
int i, j, **intMatrix;
intMatrix = (int **)malloc(sizeof(int *) * row);
for (i = 0; i < row; i++)
{
intMatrix[i] = (int *)malloc(sizeof(int *) * col);
for (j = 0;j<col;j++)
{
intMatrix[i][j]=rand()%10;
}
}
return intMatrix;
}
void display_matrix(int **matrix, int rows,int cols)
{
int i,j;
for (i = 0; i < rows; i = i + 1)
{
for (j = 0; j < cols; j = j + 1)
printf("%d ",matrix[i][j]);
printf("\n");
}
}
The main issue is your matrices are not allocated in contiguous memory (see the comment section for a link)
The MPI standard does not specify what happens before an app invokes MPI_Init().
The two main MPI implementations choose to spawn all the tasks when mpirun is invoked (that means there are 4 independent processes first, and they "join" into a single MPI job when they all call MPI_Init()).
That being said, once upon a time, a vendor chose to have mpirun start a single MPI task, and they use their own remote-fork when MPI_Init() is called.
Bottom line, if you want to write portable code, do as less as possible (and never print anything) before MPI_Init() is called.

MPI_Scatterv segfault

I'm just starting out with MPI programming and decided to make a simple distributed qsort using OpenMPI. To distribute parts of the array I want to sort I'm trying to use MPI_Scatterv, however the following code segfaults on me:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#define ARRAY_SIZE 26
#define BUFFER_SIZE 2048
int main(int argc, char** argv) {
int my_rank, nr_procs;
int* data_in, *data_out;
int* sizes;
int* offsets;
srand(time(0));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nr_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// everybody generates the control tables
int nr_workers = nr_procs-1;
sizes = malloc(sizeof(int)*nr_workers);
offsets = malloc(sizeof(int)*nr_workers);
int nr_elems = ARRAY_SIZE/nr_workers;
// basic distribution
for (int i = 0; i < nr_workers; ++i) {
sizes[i] = nr_elems;
}
// distribute the remainder
int left = ARRAY_SIZE%nr_workers;
int curr_worker = 0;
while (left) {
++sizes[curr_worker];
curr_worker = (++curr_worker)%nr_workers;
--left;
}
// offsets
int curr_offset = 0;
for (int i = 0; i < nr_workers; ++i) {
offsets[i] = curr_offset;
curr_offset += sizes[i];
}
if (my_rank == 0) {
// root
data_in = malloc(sizeof(int)*ARRAY_SIZE);
data_out = malloc(sizeof(int)*ARRAY_SIZE);
for (int i = 0; i < ARRAY_SIZE; ++i) {
data_in[i] = rand();
}
for (int i = 0; i < nr_workers; ++i) {
printf("%d at %d\n", sizes[i], offsets[i]);
}
MPI_Scatterv (data_in, sizes, offsets, MPI_INT, data_out, ARRAY_SIZE, MPI_INT, 0, MPI_COMM_WORLD);
} else {
// worker
printf("%d has %d elements!\n",my_rank, sizes[my_rank-1]);
// alloc the input buffer
data_in = malloc(sizeof(int)*sizes[my_rank-1]);
MPI_Scatterv(NULL, NULL, NULL, MPI_INT, data_in, sizes[my_rank-1], MPI_INT, 0, MPI_COMM_WORLD);
printf("%d got:\n", my_rank);
for (int i = 0; i < sizes[my_rank-1]; ++i) {
printf("%d ", data_in[i]);
}
printf("\n");
}
MPI_Finalize();
return 0;
}
How would I go about using Scatterv? Am I doing something wrong with allocating my input buffer from inside the worker code?
I changed some part in your code to get something working.
MPI_Scatter() will send data to every processors, including himself. According to your program, processor 0 expects ARRAY_SIZE integers, but sizes[0] is much smaller.
There are other problems on other processus : MPI_Scatter will send sizes[my_rank] integers, but sizes[my_rank-1] will be expected...
Here is a code that scatters data_in from 0 to all processors, including 0. Therefore i added 1 to nr_workers :
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#define ARRAY_SIZE 26
#define BUFFER_SIZE 2048
int main(int argc, char** argv) {
int my_rank, nr_procs;
int* data_in, *data_out;
int* sizes;
int* offsets;
srand(time(0));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nr_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// everybody generates the control tables
int nr_workers = nr_procs;
sizes = malloc(sizeof(int)*nr_workers);
offsets = malloc(sizeof(int)*nr_workers);
int nr_elems = ARRAY_SIZE/nr_workers;
// basic distribution
for (int i = 0; i < nr_workers; ++i) {
sizes[i] = nr_elems;
}
// distribute the remainder
int left = ARRAY_SIZE%nr_workers;
int curr_worker = 0;
while (left) {
++sizes[curr_worker];
curr_worker = (++curr_worker)%nr_workers;
--left;
}
// offsets
int curr_offset = 0;
for (int i = 0; i < nr_workers; ++i) {
offsets[i] = curr_offset;
curr_offset += sizes[i];
}
if (my_rank == 0) {
// root
data_in = malloc(sizeof(int)*ARRAY_SIZE);
for (int i = 0; i < ARRAY_SIZE; ++i) {
data_in[i] = rand();
printf("%d %d \n",i,data_in[i]);
}
for (int i = 0; i < nr_workers; ++i) {
printf("%d at %d\n", sizes[i], offsets[i]);
}
} else {
printf("%d has %d elements!\n",my_rank, sizes[my_rank]);
}
data_out = malloc(sizeof(int)*sizes[my_rank]);
MPI_Scatterv (data_in, sizes, offsets, MPI_INT, data_out, sizes[my_rank], MPI_INT, 0, MPI_COMM_WORLD);
printf("%d got:\n", my_rank);
for (int i = 0; i < sizes[my_rank]; ++i) {
printf("%d ", data_out[i]);
}
printf("\n");
free(data_out);
if(my_rank==0){
free(data_in);
}
MPI_Finalize();
return 0;
}
Regarding memory managment, data_in and data_out should be freed at the end of the code.
Is it what you wanted to do ? Good luck with qsort ! I think you are not the first one to sort integers using MPI. See parallel sort using mpi. Your way to generate random numbers on the 0 processus and then scatter them is the right way to go. I think you will be interrested by his TD_Trier() function for communication. Even if you change tri_fusion(T, 0, size - 1); for qsort(...)...
Bye,
Francis

Troubles in MPI -> Failing at address: (nil)

I'm a beginner in C and MPI and i'm trying to do a program to multiply 2 matrix in MPI.
But I don't kwno what is wrong in my code.
I'm try do 'slice' the matrix M1 in n lines and send then to another process do multiply and broadcast de matrix M2 After I make a Gather to build the final matrix M3.
I make this:
mpirun -n 2 matrix
But I receive a error in terminal:
[adiel-VirtualBox:07921] *** Process received signal ***
[adiel-VirtualBox:07921] Signal: Segmentation fault (11)
[adiel-VirtualBox:07921] Signal code: (128)
[adiel-VirtualBox:07921] Failing at address: (nil)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 7921 on node adiel-VirtualBox exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished
Can anyone help me?
Here's my code:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
//#include "mpe.h"
#include <math.h>
void printMatrix(double *M, int m, int n) {
int lin, col;
for (lin=0; lin<m; lin++) {
for (col=0; col<n; col++)
printf("%.2f \t", M[(lin*n+col)]);
printf("\n");
}
}
double* allocateMatrix(int m, int n){
double* M;
M = (double *)malloc(m*n*sizeof(double));
return M;
}
int main( int argc, char *argv[] )
{
int rank, size;
int m1,n1,m2,n2;
int row, col,ctrl,i,k,lines,proc;
double *M1, *M2, *M3, **vp, *v;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
m1 = m2 = n1 = n2 = 3;
lines = (int)ceil(n1/size);
v = (double *)malloc(lines*n1*sizeof(double));
M2 = allocateMatrix(m2,n2);
M3 = allocateMatrix(m1,n2);
if(rank==0)
M1 = allocateMatrix(m1,n1);
//startin matrix
for (col = 0; col < n1; col++){
for (row = 0; row < m1; row++) {
if(rank==0)
M1[(row*m1+col)] = 0;
M2[(row*m2+col)] = 0;
M3[(row*m1+col)] = 0;
}
}
//startin pointers with 0
for(i=0;i<lines*n1;i++)
v[i] = 0;
//populate
if(rank == 0){
for (col = 0; col < n1; col++){
for (row = 0; row < m1; row++) {
M1[row*m1+col] = row*3+(col+1);
M2[(row*m2+col)] = 1;
}
}
}
//---------------------sharing and multiply---------------//
//slicing M1 and sending to other process
if(rank == 0){
proc = size-1;
//for each line
for(row = 0;row<m1;row++){
ctrl = floor(row/lines);
//on each column
for(col=0;col<n1;col++){
v[(ctrl*n1)+col] = M1[(row*n1)+col];
}
if(row%lines == (lines - 1)){
if(proc!=0){
MPI_Send(v,lines*n1,MPI_DOUBLE,proc,1, MPI_COMM_WORLD);
proc--;
//clearing pointers
for(i=0;i<lines*n1;i++)
v[i] = 0;
}
}
}
}
//MPI_Bcast(m1, m*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(M2, m2*n2, MPI_DOUBLE, 0, MPI_COMM_WORLD);
//receiving process
if(rank!=0)
MPI_Recv(v,lines*n1,MPI_DOUBLE,0,1,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(row=0;row<lines;row++){
if(v[row*n1]!=0){
for (col = 0; col < n1; col++){
double val = 0.0;
for(k=0;k<m1;k++){
val += v[(row*n1)+k] * M2[(k*n1)+col];
}
M3[((size-1-rank)*size*n1)+(row*n1)+col] = val;
}
}
}
if(rank!=0){
for(row = 0; row < lines; row++){
MPI_Gather(&M3[((size-1-rank)*size*n1)+(row*n1)], n1, MPI_DOUBLE, &M3[((size-1-rank)*size*n1)+(row*n1)], n1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
}
}
if(rank == 0){
printf("matrix 1------------------------\n");
printMatrix(M1,m1,n1);
printf("matrix 2------------------------\n");
printMatrix(M2,m2,n2);
printf("matrix 3------------------------\n");
printMatrix(M3,m1,n2);
}
MPI_Finalize();
return 0;
}
For one thing, doing all of your sends before the broadcast, and all of the receives after it is asking for trouble. I can easily see that leading to MPI resource exhaustion or deadlock failures. In such a small input that shouldn't arise, but you should fix it regardless. I'll take another look after that.

Problem with MPI matrix-matrix multiply: Cluster slower than single computer

I code a small program using MPI to parallelize matrix-matrix multiplication. The problem is: When running the program on my computer, it takes about 10 seconds to complete, but about 75 seconds on a cluster. I think I have some synchronization problem, but I cannot figure it out (yet).
Here's my source code:
/*matrix.c
mpicc -o out matrix.c
mpirun -np 11 out
*/
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define N 1000
#define DATA_TAG 10
#define B_SENT_TAG 20
#define FINISH_TAG 30
int master(int);
int worker(int, int);
int main(int argc, char **argv) {
int myrank, p;
double s_time, f_time;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
if (myrank == 0) {
s_time = MPI_Wtime();
master(p);
f_time = MPI_Wtime();
printf("Complete in %1.2f seconds\n", f_time - s_time);
fflush(stdout);
}
else {
worker(myrank, p);
}
MPI_Finalize();
return 0;
}
int *read_matrix_row();
int *read_matrix_col();
int send_row(int *, int);
int recv_row(int *, int, MPI_Status *);
int send_tag(int, int);
int write_matrix(int *);
int master(int p) {
MPI_Status status;
int *a; int *b;
int *c = (int *)malloc(N * sizeof(int));
int i, j; int num_of_finish_row = 0;
while (1) {
for (i = 1; i < p; i++) {
a = read_matrix_row();
b = read_matrix_col();
send_row(a, i);
send_row(b, i);
//printf("Master - Send data to worker %d\n", i);fflush(stdout);
}
wait();
for (i = 1; i < N / (p - 1); i++) {
for (j = 1; j < p; j++) {
//printf("Master - Send next row to worker[%d]\n", j);fflush(stdout);
b = read_matrix_col();
send_row(b, j);
}
}
for (i = 1; i < p; i++) {
//printf("Master - Announce all row of B sent to worker[%d]\n", i);fflush(stdout);
send_tag(i, B_SENT_TAG);
}
//MPI_Barrier(MPI_COMM_WORLD);
for (i = 1; i < p; i++) {
recv_row(c, MPI_ANY_SOURCE, &status);
//printf("Master - Receive result\n");fflush(stdout);
num_of_finish_row++;
}
//printf("Master - Finish %d rows\n", num_of_finish_row);fflush(stdout);
if (num_of_finish_row >= N)
break;
}
//printf("Master - Finish multiply two matrix\n");fflush(stdout);
for (i = 1; i < p; i++) {
send_tag(i, FINISH_TAG);
}
//write_matrix(c);
return 0;
}
int worker(int myrank, int p) {
int *a = (int *)malloc(N * sizeof(int));
int *b = (int *)malloc(N * sizeof(int));
int *c = (int *)malloc(N * sizeof(int));
int i;
for (i = 0; i < N; i++) {
c[i] = 0;
}
MPI_Status status;
int next = (myrank == (p - 1)) ? 1 : myrank + 1;
int prev = (myrank == 1) ? p - 1 : myrank - 1;
while (1) {
recv_row(a, 0, &status);
if (status.MPI_TAG == FINISH_TAG)
break;
recv_row(b, 0, &status);
wait();
//printf("Worker[%d] - Receive data from master\n", myrank);fflush(stdout);
while (1) {
for (i = 1; i < p; i++) {
//printf("Worker[%d] - Start calculation\n", myrank);fflush(stdout);
calc(c, a, b);
//printf("Worker[%d] - Exchange data with %d, %d\n", myrank, next, prev);fflush(stdout);
exchange(b, next, prev);
}
//printf("Worker %d- Request for more B's row\n", myrank);fflush(stdout);
recv_row(b, 0, &status);
//printf("Worker %d - Receive tag %d\n", myrank, status.MPI_TAG);fflush(stdout);
if (status.MPI_TAG == B_SENT_TAG) {
break;
//printf("Worker[%d] - Finish calc one row\n", myrank);fflush(stdout);
}
}
//wait();
//printf("Worker %d - Send result\n", myrank);fflush(stdout);
send_row(c, 0);
for (i = 0; i < N; i++) {
c[i] = 0;
}
}
return 0;
}
int *read_matrix_row() {
int *row = (int *)malloc(N * sizeof(int));
int i;
for (i = 0; i < N; i++) {
row[i] = 1;
}
return row;
}
int *read_matrix_col() {
int *col = (int *)malloc(N * sizeof(int));
int i;
for (i = 0; i < N; i++) {
col[i] = 1;
}
return col;
}
int send_row(int *row, int dest) {
MPI_Send(row, N, MPI_INT, dest, DATA_TAG, MPI_COMM_WORLD);
return 0;
}
int recv_row(int *row, int src, MPI_Status *status) {
MPI_Recv(row, N, MPI_INT, src, MPI_ANY_TAG, MPI_COMM_WORLD, status);
return 0;
}
int wait() {
MPI_Barrier(MPI_COMM_WORLD);
return 0;
}
int calc(int *c_row, int *a_row, int *b_row) {
int i;
for (i = 0; i < N; i++) {
c_row[i] = c_row[i] + a_row[i] * b_row[i];
//printf("%d ", c_row[i]);
}
//printf("\n");fflush(stdout);
return 0;
}
int exchange(int *row, int next, int prev) {
MPI_Request request; MPI_Status status;
MPI_Isend(row, N, MPI_INT, next, DATA_TAG, MPI_COMM_WORLD, &request);
MPI_Irecv(row, N, MPI_INT, prev, MPI_ANY_TAG, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
return 0;
}
int send_tag(int worker, int tag) {
MPI_Send(0, 0, MPI_INT, worker, tag, MPI_COMM_WORLD);
return 0;
}
int write_matrix(int *matrix) {
int i;
for (i = 0; i < N; i++) {
printf("%d ", matrix[i]);
}
printf("\n");
fflush(stdout);
return 0;
}
Well, you have a fairly small matrix (N=1000), and secondly you distribute your algorithm on a row/column basis rather than blocked.
For a more realistic version using better algorithms, you might want to acquire an optimized BLAS library (e.g. GOTO is free), test single-thread performance with that one, then get PBLAS and link it against your optimized BLAS, and compare MPI parallel performance using the PBLAS version.
I see some errors in your program:
First, why are you calling the wait function since its implementation is simply calling MPI_Barrier. MPI_Barrier is a primitive synchronization that blocks all threads until they reach the "barrier" by calling MPI_Barrier. My question is: do you want the master to be synchronized with the workers? In this context, that would not be optimal because a worker doesn't need to wait for the master to begin its calculation.
Second, there are 2 unnecessary for loops.
for (i = 1; i < N / (p - 1); i++) {
for (j = 1; j < p; j++) {
b = read_matrix_col();
send_row(b, j);
}
}
for (i = 1; i < p; i++) {
send_tag(i, B_SENT_TAG);
}
In the first i-loop, you don't use the variable in your statement. Since the j-loop and the second i-loop are the same, you could do:
for (i = 0; i < p; i++) {
b = read_matrix_col();
send_row(b, j);
send_tag(i, B_SENT_TAG);
}
In terms of data transfer, your program is not optimized because you are sending an array of 1000 integers of data for each data transfer. There should be a better way to optimise the data transfer, but I will let you look at it. So make the corrections I told you and tell us what is your new performance.
And as #janneb said, you can use the BLAS library for better performance for matrix multiplication. Good luck!
I did not look over your code, but I can provide some hints about why your result may not unexpected:
As already mentioned, N=1000 may be too small. You should make more tests to see the scalability of your program (try setting N=100, 500, 1000, 5000, 10000, etc.) and compare results on both your system and the cluster.
Compare results between your system (one processor I presume) and a single processor on the cluster. Usually in production environments like servers or clusters a single processor is less powerful than the best processors designed for desktop use, but they provide stability, reliability and other features useful for environments which run 24h/day at full capacity.
If your processor has multiple cores, more than one MPI processes may run at the same time and synchronization between them is negligible compared to the synchronization between nodes in a cluster.
Are the nodes from the cluster statically assigned to you? Maybe other users' programs can be scheduled on the nodes you are running at the same time as you.
Read documentation about the cluster's architecture. Some architectures may be more suitable for particular classes of problems.
Assess latency of the network of the cluster. Ping-ing from each node to another many times and computing the mean value may give a rough estimate.
Last but perhaps the most important, your algorithm may not be optimal. Read a/some books on matrix multiplication (I can recommend "Matrix Computations", Golub and Van Loan).

Resources