Writing distributed arrays using MPI-IO and Cartesian topology - c

I have an MPI code that implements 2D domain decomposition to compute numerical solutions to a PDE. Currently I write certain 2D distributed arrays out for each process (e.g. array_x--> proc000x.bin). I want to reduce that to a single binary file.
array_0, array_1,
array_2, array_3,
Suppose the above illustrates a cartesian topology with 4 processes (2x2). Each 2D array has dimension (nx + 2, nz + 2). The +2 signifies "ghost" layers added to all sides for communication purposes.
I would like to extract the main arrays (omit the ghost layers) and write them to a single binary file with an order something like,
array_0, array_1, array_2, array_3 --> output.bin
If possible it would be preferable to write it as though I had access to the global grid and was writing row-by-row i.e.,
row 0 of array_0, row 0 of array_1, row 1 of array_0 row_1 of array_1 ....
The attempt below attempts the former of the two output formats in file array_test.c
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
/* 2D array allocation */
float **alloc2D(int rows, int cols);
float **alloc2D(int rows, int cols) {
int i, j;
float *data = malloc(rows * cols * sizeof(float));
float **arr2D = malloc(rows * sizeof(float *));
for (i = 0; i < rows; i++) {
arr2D[i] = &(data[i * cols]);
}
/* Initialize to zero */
for (i= 0; i < rows; i++) {
for (j=0; j < cols; j++) {
arr2D[i][j] = 0.0;
}
}
return arr2D;
}
int main(void) {
/* Creates 5x5 array of floats with padding layers and
* attempts to write distributed arrays */
/* Run toy example with 4 processes */
int i, j, row, col;
int nx = 5, ny = 5, npad = 1;
int my_rank, nproc=4;
int dim[2] = {2, 2}; /* 2x2 cartesian grid */
int period[2] = {0, 0};
int coord[2];
int reorder = 1;
float **A = NULL;
MPI_Comm grid_Comm;
/* Initialize MPI */
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Establish cartesian topology */
MPI_Cart_create(MPI_COMM_WORLD, 2, dim, period, reorder, &grid_Comm);
/* Get cartesian grid indicies of processes */
MPI_Cart_coords(grid_Comm, my_rank, 2, coord);
row = coord[1];
col = coord[0];
/* Add ghost layers */
nx += 2 * npad;
ny += 2 * npad;
A = alloc2D(nx, ny);
/* Create derived datatype for interior grid (output grid) */
MPI_Datatype grid;
int start[2] = {npad, npad};
int arrsize[2] = {nx, ny};
int gridsize[2] = {nx - 2 * npad, ny - 2 * npad};
MPI_Type_create_subarray(2, arrsize, gridsize,
start, MPI_ORDER_C, MPI_FLOAT, &grid);
MPI_Type_commit(&grid);
/* Fill interior grid */
for (i = npad; i < nx-npad; i++) {
for (j = npad; j < ny-npad; j++) {
A[i][j] = my_rank + i;
}
}
/* MPI IO */
MPI_File fh;
MPI_Status status;
char file_name[100];
int N, offset;
sprintf(file_name, "output.bin");
MPI_File_open(grid_Comm, file_name, MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
N = (nx - 2 * npad) * (ny - 2 *npad);
offset = (row * 2 + col) * N * sizeof(float);
MPI_File_set_view(fh, offset, MPI_FLOAT, grid, "native",
MPI_INFO_NULL);
MPI_File_write_all(fh, &A[0][0], N, MPI_FLOAT, MPI_STATUS_IGNORE);
MPI_File_close(&fh);
/* Cleanup */
free(A[0]);
free(A);
MPI_Type_free(&grid);
MPI_Finalize();
return 0;
}
Compiles with
mpicc -o array_test array_test.c
Runs with
mpiexec -n 4 array_test
While the code compiles and runs, the output is incorrect. I'm assuming that I have misinterpreted the use of the derived datatype and file writing in this case. I'd appreciate some help figuring out my mistakes.

The error you make here is that you have the wrong file view. Instead of creating a type representing the share of the file the current processor is responsible of, you use the mask corresponding to the local data you want to write.
You have actually two very distinct masks to consider:
The mask for the local data, excluding the halo layers; and
The mask for the global data, as it should be once collated into the file.
The former corresponds to this layout:
Here, the data that you want to output on the file for a given process in in dark blue, and the halo layer that should not be written on the file is in lighter blue.
The later corresponds to this layout:
Here, each colour corresponds to the local data coming from a different process, as distributed on the 2D Cartesian grid.
To understand what you need to create to reach this final result, you have to think backwards:
Your final call to the IO routine should be MPI_File_write_all(fh, &A[0][0], 1, interior, MPI_STATUS_IGNORE);. So you have to have your interior type defined such as to exclude the halo boundary. Well fortunately, the type grid you created already does exactly that. So we will use it.
But now, you have to have the view on the file to allow for this MPI_Fie_write_all() call. So the view must be as described in the second picture. We will therefore create a new MPI type representing it. For that, MPI_Type_create_subarray() is what we need.
Here is the synopsis of this function:
int MPI_Type_create_subarray(int ndims,
const int array_of_sizes[],
const int array_of_subsizes[],
const int array_of_starts[],
int order,
MPI_Datatype oldtype,
MPI_Datatype *newtype)
Create a datatype for a subarray of a regular, multidimensional array
INPUT PARAMETERS
ndims - number of array dimensions (positive integer)
array_of_sizes
- number of elements of type oldtype in each
dimension of the full array (array of positive integers)
array_of_subsizes
- number of elements of type oldtype in each dimension of
the subarray (array of positive integers)
array_of_starts
- starting coordinates of the subarray in each dimension
(array of nonnegative integers)
order - array storage order flag (state)
oldtype - array element datatype (handle)
OUTPUT PARAMETERS
newtype - new datatype (handle)
For our 2D Cartesian file view, here are what we need for these input parameters:
ndims: 2 as the grid is 2D
array_of_sizes: these are the dimensions of the global array to output, namely { nnx*dim[0], nny*dim[1] }
array_of_subsizes: these are the dimensions of the local share of the data to output, namely { nnx, nny }
array_of_start: these are the x,y start coordinates of the local share into the global grid, namely { nnx*coord[0], nny*coord[1] }
order: the ordering is C so this must be MPI_ORDER_C
oldtype: data are floats so this must be MPI_FLOAT
Now that we have our type for the file view, we simply apply it with MPI_File_set_view(fh, 0, MPI_FLOAT, view, "native", MPI_INFO_NULL); and the magic is done.
Your full code becomes:
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
/* 2D array allocation */
float **alloc2D(int rows, int cols);
float **alloc2D(int rows, int cols) {
int i, j;
float *data = malloc(rows * cols * sizeof(float));
float **arr2D = malloc(rows * sizeof(float *));
for (i = 0; i < rows; i++) {
arr2D[i] = &(data[i * cols]);
}
/* Initialize to zero */
for (i= 0; i < rows; i++) {
for (j=0; j < cols; j++) {
arr2D[i][j] = 0.0;
}
}
return arr2D;
}
int main(void) {
/* Creates 5x5 array of floats with padding layers and
* attempts to write distributed arrays */
/* Run toy example with 4 processes */
int i, j, row, col;
int nx = 5, ny = 5, npad = 1;
int my_rank, nproc=4;
int dim[2] = {2, 2}; /* 2x2 cartesian grid */
int period[2] = {0, 0};
int coord[2];
int reorder = 1;
float **A = NULL;
MPI_Comm grid_Comm;
/* Initialize MPI */
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Establish cartesian topology */
MPI_Cart_create(MPI_COMM_WORLD, 2, dim, period, reorder, &grid_Comm);
/* Get cartesian grid indicies of processes */
MPI_Cart_coords(grid_Comm, my_rank, 2, coord);
row = coord[1];
col = coord[0];
/* Add ghost layers */
nx += 2 * npad;
ny += 2 * npad;
A = alloc2D(nx, ny);
/* Create derived datatype for interior grid (output grid) */
MPI_Datatype grid;
int start[2] = {npad, npad};
int arrsize[2] = {nx, ny};
int gridsize[2] = {nx - 2 * npad, ny - 2 * npad};
MPI_Type_create_subarray(2, arrsize, gridsize,
start, MPI_ORDER_C, MPI_FLOAT, &grid);
MPI_Type_commit(&grid);
/* Fill interior grid */
for (i = npad; i < nx-npad; i++) {
for (j = npad; j < ny-npad; j++) {
A[i][j] = my_rank + i;
}
}
/* Create derived type for file view */
MPI_Datatype view;
int nnx = nx-2*npad, nny = ny-2*npad;
int startV[2] = { coord[0]*nnx, coord[1]*nny };
int arrsizeV[2] = { dim[0]*nnx, dim[1]*nny };
int gridsizeV[2] = { nnx, nny };
MPI_Type_create_subarray(2, arrsizeV, gridsizeV,
startV, MPI_ORDER_C, MPI_FLOAT, &view);
MPI_Type_commit(&view);
/* MPI IO */
MPI_File fh;
MPI_File_open(grid_Comm, "output.bin", MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
MPI_File_set_view(fh, 0, MPI_FLOAT, view, "native", MPI_INFO_NULL);
MPI_File_write_all(fh, &A[0][0], 1, grid, MPI_STATUS_IGNORE);
MPI_File_close(&fh);
/* Cleanup */
free(A[0]);
free(A);
MPI_Type_free(&view);
MPI_Type_free(&grid);
MPI_Finalize();
return 0;
}

Related

MPI user defined type for parts of three different matrices

I am doing a particle simulation, and need to send some part of three different arrays to other processes. How to use MPI user defined types to do this?
For example, suppose, I have three Matrixes with datatype being double, A, B, and C on Process 1. Now I want to send the first two rows of A, B, and C to Process 2. So how to use MPI user defined type to do this, assuming C type storage for these Matrices? Thank you.
Currently, I am copying the first two rows of these Matrices to a single buffer, and then perform MPI Send. These involves basically the following steps:
Copy the first two rows of A, B, and C to a send_buffer on Process 1.
Send the send_buffer from Process 1 to Process 2.
On Process 2, use recv_buffer to receive data from Process 1.
On Process 2, copy data from recv_buffer to A, B, C on Process 2.
I hope there is a better way to do this. Thanks.
In the code below, an MPI data type is defined to communicate a range of rows of a matrix. If there are three matrices then there would be three send/receive. You can compare the following code with your own code to see which one is better.
If you think transferring matrices one by one is not efficient then you might put all matrices inside a struct and make a MPI data type or consider using MPI_PACK.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
void make_layout(int row_begin, int row_end, int ncol, MPI_Datatype* mpi_dtype)
{
int nblock = 1;
int block_count = (row_end - row_begin + 1) * ncol;
MPI_Aint lb, extent;
MPI_Type_get_extent(MPI_DOUBLE, &lb, &extent);
MPI_Aint offset = row_begin * ncol * extent;
MPI_Datatype block_type = MPI_DOUBLE;
MPI_Type_create_struct(nblock, &block_count, &offset, &block_type, mpi_dtype);
MPI_Type_commit(mpi_dtype);
}
double** allocate(int nrow, int ncol)
{
double *data = (double *)malloc(nrow*ncol*sizeof(double));
double **array= (double **)malloc(nrow*sizeof(double*));
for (int i=0; i<nrow; i++)
array[i] = &(data[ncol*i]);
return array;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// make 3x3 matrix.
int nrow = 3;
int ncol = 3;
double** A = allocate(nrow, ncol);
// make mpi datatype to send rows [0, 1] excluding row 2.
// you can send any range of rows i.e. rows [row_begin, row_end].
int row_begin = 0;
int row_end = 1; // inclusive.
MPI_Datatype mpi_dtype;
make_layout(row_begin, row_end, ncol, &mpi_dtype);
if (rank == 0)
MPI_Send(&(A[0][0]), 1, mpi_dtype, 1, 0, MPI_COMM_WORLD);
else
MPI_Recv(&(A[0][0]), 1, mpi_dtype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Type_free(&mpi_dtype);
MPI_Finalize();
return 0;
}

C: (Working But Not As Intended) My determinate method is editing the provided matrix (as a parameter) but I just want it to provide the determinate

Here is my code for the determinate function:
double getDeterminate(Matrix m) {
//We don't want to change the given matrix, just
//find it's determinate so use a temporary matrix
//instead for the calculations
double **temp = m.matrix;
int rows = m.rows, cols = m.cols;
//If the matrix is not square then the derminate does not exist
if(rows == cols) {
//If the matrix is 2x2 matrix then the
//determinate is ([0,0]x[1,1]) - ([0,1]x[1,0])
if(rows == 2 && cols == 2) {
return (temp[0][0] * temp[1][1]) - (temp[0][1] * temp[1][0]);
}
//Otherwise if it is n*n...we do things the long way
//we will be using a method where we convert the
//matrix into an upper triangle, and then the det
//will simply be the multiplication of all diagonal
//indexes
int i, j, k;
double ratio;
for(i = 0; i < rows; i++){
for(j = 0; j < cols; j++){
if(j>i){
ratio = temp[j][i]/temp[i][i];
for(k = 0; k < rows; k++) {
temp[j][k] -= ratio * temp[i][k];
}
}
}
}
double det = 1; //storage for determinant
for(i = 0; i < rows; i++) {
det *= temp[i][i];
}
return det;
}
return 0;
}
And here is my struct for the Matrix (in an external header file):
#ifndef MATRIX_H
#define MATRIX_H
typedef struct Matrix {
int rows;
int cols;
double ** matrix;
} Matrix;
/** Build/Destroy/Display Methods
-------------------------------------------*/
Matrix buildMatrix(int r, int c);
Matrix displayMatrix(Matrix m);
/** Matrix Get and Set
-------------------------------------------*/
void setValue(Matrix m, double val, int r, int c);
double getValue(Matrix m, int r, int c);
/** Matrix Multiplication
-------------------------------------------*/
Matrix multByValue(double val, Matrix m);
Matrix multByMatrix(Matrix mat1, Matrix mat2);
/** Matrix Inversion
-------------------------------------------*/
Matrix invertMatrix(Matrix m);
/** Matrix Determinate
-------------------------------------------*/
double getDeterminate(Matrix m);
/** Matrix Transpose
-------------------------------------------*/
Matrix transpose(Matrix m);
#endif
Currently it is doing exactly what I want it to do, finding the determinate by turning the matrix into an upper triangle. However, when I run the determinate method, it still changes the matrix I provided as a parameter, even though I believe I used temporary variables to do the calculations without messing with given Matrix.
Example In my main.c I have this:
Matrix mat2 = buildMatrix(3, 3);
setValue(mat2, 12, 0, 0); setValue(mat2, 4, 0, 1); setValue(mat2, 3, 0, 2);
setValue(mat2, 4, 1, 0); setValue(mat2, 13, 1, 1); setValue(mat2, 25, 1, 2);
setValue(mat2, 78, 2, 0); setValue(mat2, 8, 2, 1); setValue(mat2, 9, 2, 2);
displayMatrix(mat2);
printf("Determinate of mat2 is %4.2f", getDeterminate(mat2));
Matrix mat4 = transpose(mat2);
displayMatrix(mat4);
And when I run it I get:
Mat2 Before Transpose
| 12.00 4.00 3.00|
| 4.00 13.00 25.00|
| 78.00 8.00 9.00|
Determinate of mat2 is 3714.00
Mat4 After Transpose of Mat2
| 12.00 0.00 0.00|
| 4.00 11.67 0.00|
| 3.00 24.00 26.53|
As you can see the transpose function is doing it's job correctly, but mat2 has been changed by the determinate method. I don't want that to occur, so what have I done to cause this to happen, and what can I do to fix it? I think it has something to do with pointer, I am new to C, so I don't really know what to do.
What you did is called shallow copy.
Here: double **temp = m.matrix; you just assigned pointer to memory that hold your matrix's elements to the temp matrix, so when you edit it, it edits the data that your original matrix holds as well. That's because the original elements only exist a t one place, and poth temp and m pointers are pointed to it.
In order to do it the way you want it, you have to create a deep copy. You have to allocate new memory for the temp matrix and copy every single element from the original one into it. Don't forget to free() the memory, if you will allocate it dynamically, once you are done with it.

Gaussian Blur, Mean Filter, Convolution

I want to implement a convolution function to use in mean filter and gaussian filter and I need to implement those 2 filters as well to apply to pgm files.
I have
typedef struct _PGM{
int row;
int col;
int max_value;
int **matrix;
}PGM;
struct and
int convolution(int ** kernel,int ksize, PGM * image, PGM * output){
int i, j, x, y;
int sum;
int data;
int scale =ksize*ksize;
int coeff;
for (x=ksize/2; x<image->row-ksize/2;++x) {
for (y=ksize/2; y<image->col-ksize/2; ++y){
sum = 0;
for (i=-ksize/2; i<=ksize/2; ++i){
for (j=-ksize/2; j<=ksize/2; ++j){
data = image->matrix[x +i][y +j];
coeff = kernel[i+ksize/2][j+ksize/2];
sum += data * coeff;
}
}
output->matrix[x][y] = sum / scale;
}
}
return sum/scale;
}
convolution function but I get error(actually it terminates) in convolution function so I could not proceed to filter
Can you help me with the implementation ?
Thank you.
In your convolution there are two things wrong that probably aren't causing the crash. The first is style: You're using x to iterate over the rows of an image, something I picture more as a y displacement, and vice-versa. The second is that when you're computing the sum, you're not resetting the variable sum = 0 prior to evaluating the kernel (the inner two loops) for each pixel. Instead you accumulate sum over all pixels, probably eventually causing integer overflow. While strictly speaking this is UB and could cause a crash, it's not the issue you're facing.
If you would kindly confirm that the crash occurs on the first pixel (x = ksize/2, y = ksize/2), then since the crash occurs at the first coefficient read from the kernel, I suspect you may have passed the "wrong thing" as the kernel. As presented, the kernel is an int**. For a kernel size of 3x3, this means that to call this function correctly, you must have allocated on the heap or stack an array of int*, where you stored 3 pointers to int arrays with 3 coefficients each. If you instead passed a int[3][3] array, the convolution function will attempt to interpret the first one or two int in the array as a pointer to an int when it is not, and try to dereference it to pull in the coefficient. This will most likely cause a segfault.
I also don't know why you are returning the accumulated sum. This isn't a "traditional" output of convolution, but I surmise you were interested in the average brightness of the output image, which is legitimate; In this case you should use a separate and wider integer accumulator (long or long long) and, at the end, divide it by the number of pixels in the output.
You probably found the PGM data structure from the internet, say, here. Allow me to part with this best-practice advice. In my field (computer vision), the computer vision library of choice, OpenCV, does not express a matrix as an array of row pointers to buffers of col elements. Instead, a large slab of memory is allocated, in this case of size image->row * image->col * sizeof(int) at a minimum, but often image->row * image->step * sizeof(int) where image->step is image->col rounded up to the next multiple of 4 or 16. Then, only a single pointer is kept, a pointer to the base of the entire image, although an extra field (the step) has to be kept if images aren't continuous.
I would therefore rework your code thus:
/* Includes */
#include <stdlib.h>
/* Defines */
#define min(a, b) (((a) < (b)) ? (a) : (b))
#define max(a, b) (((a) > (b)) ? (a) : (b))
/* Structure */
/**
* Mat structure.
*
* Stores the number of rows and columns in the matrix, the step size
* (number of elements to jump from one row to the next; must be larger than or
* equal to the number of columns), and a pointer to the first element.
*/
typedef struct Mat{
int rows;
int cols;
int step;
int* data;
} Mat;
/* Functions */
/**
* Allocation. Allocates a matrix big enough to hold rows * cols elements.
*
* If a custom step size is wanted, it can be given. Otherwise, an invalid one
* can be given (such as 0 or -1), and the step size will be chosen
* automatically.
*
* If a pointer to existing data is provided, don't bother allocating fresh
* memory. However, in that case, rows, cols and step must all be provided and
* must be correct.
*
* #param [in] rows The number of rows of the new Mat.
* #param [in] cols The number of columns of the new Mat.
* #param [in] step The step size of the new Mat. For newly-allocated
* images (existingData == NULL), can be <= 0, in
* which case a default step size is chosen; For
* pre-existing data (existingData != NULL), must be
* provided.
* #param [in] existingData A pointer to existing data. If NULL, a fresh buffer
* is allocated; Otherwise the given data is used as
* the base pointer.
* #return An allocated Mat structure.
*/
Mat allocMat(int rows, int cols, int step, int* existingData){
Mat M;
M.rows = max(rows, 0);
M.cols = max(cols, 0);
M.step = max(step, M.cols);
if(rows <= 0 || cols <= 0){
M.data = 0;
}else if(existingData == 0){
M.data = malloc(M.rows * M.step * sizeof(*M.data));
}else{
M.data = existingData;
}
return M;
}
/**
* Convolution. Convolves input by the given kernel (centered) and stores
* to output. Does not handle boundaries (i.e., in locations near the border,
* leaves output unchanged).
*
* #param [in] input The input image.
* #param [in] kern The kernel. Both width and height must be odd.
* #param [out] output The output image.
* #return Average brightness of output.
*
* Note: None of the image buffers may overlap with each other.
*/
int convolution(const Mat* input, const Mat* kern, Mat* output){
int i, j, x, y;
int coeff, data;
int sum;
int avg;
long long acc = 0;
/* Short forms of the image dimensions */
const int iw = input ->cols, ih = input ->rows, is = input ->step;
const int kw = kern ->cols, kh = kern ->rows, ks = kern ->step;
const int ow = output->cols, oh = output->rows, os = output->step;
/* Kernel half-sizes and number of elements */
const int kw2 = kw/2, kh2 = kh/2;
const int kelem = kw*kh;
/* Left, right, top and bottom limits */
const int l = kw2,
r = max(min(iw-kw2, ow-kw2), l),
t = kh2,
b = max(min(ih-kh2, oh-kh2), t);
/* Total number of pixels */
const int totalPixels = (r-l)*(b-t);
/* Input, kernel and output base pointers */
const int* iPtr = input ->data;
const int* kPtr = kern ->data + kw2 + ks*kh2;
int* oPtr = output->data;
/* Iterate over pixels of image */
for(y=t; y<b; y++){
for(x=l; x<r; x++){
sum = 0;
/* Iterate over elements of kernel */
for(i=-kh2; i<=kh2; i++){
for(j=-kw2; j<=kw2; j++){
data = iPtr[j + is*i + x];
coeff = kPtr[j + ks*i ];
sum += data * coeff;
}
}
/* Compute average. Add to accumulator and store as output. */
avg = sum / kelem;
acc += avg;
oPtr[x] = avg;
}
/* Bump pointers by one row step. */
iPtr += is;
oPtr += os;
}
/* Compute average brightness over entire output */
if(totalPixels == 0){
avg = 0;
}else{
avg = acc/totalPixels;
}
/* Return average brightness */
return avg;
}
/**
* Main
*/
int main(int argc, char* argv[]){
/**
* Coefficients of K. Binomial 3x3, separable. Unnormalized (weight = 16).
* Step = 3.
*/
int Kcoeff[3][3] = {{1, 2, 1}, {2, 4, 2}, {1, 2, 1}};
Mat I = allocMat(1920, 1080, 0, 0);/* FullHD 1080p: 1920x1080 */
Mat O = allocMat(1920, 1080, 0, 0);/* FullHD 1080p: 1920x1080 */
Mat K = allocMat( 3, 3, 3, &Kcoeff[0][0]);
/* Fill Mat I with something.... */
/* Convolve with K... */
int avg = convolution(&I, &K, &O);
/* Do something with O... */
/* Return */
return 0;
}
Reference: Years of experience in computer vision.

Segmentation fault after scatterv

/**
* BLOCK_LOW
* Returns the offset of a local array
* with regards to block decomposition
* of a global array.
*
* #param (int) process rank
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) offset of local array in global array
*/
#define BLOCK_LOW(id, p, n) ((id)*(n)/(p))
/**
* BLOCK_HIGH
* Returns the index immediately after the
* end of a local array with regards to
* block decomposition of a global array.
*
* #param (int) process rank
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) offset after end of local array
*/
#define BLOCK_HIGH(id, p, n) (BLOCK_LOW((id)+1, (p), (n)))
/**
* BLOCK_SIZE
* Returns the size of a local array
* with regards to block decomposition
* of a global array.
*
* #param (int) process rank
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) size of local array
*/
#define BLOCK_SIZE(id, p, n) ((BLOCK_HIGH((id), (p), (n))) - (BLOCK_LOW((id), (p), (n))))
/**
* BLOCK_OWNER
* Returns the rank of the process that
* handles a certain local array with
* regards to block decomposition of a
* global array.
*
* #param (int) index in global array
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) rank of process that handles index
*/
#define BLOCK_OWNER(i, p, n) (((p)*((i)+1)-1)/(n))
/*Matricefilenames:
small matrix A.bin of dimension 100 × 50
small matrix B.bin of dimension 50 × 100
large matrix A.bin of dimension 1000 × 500
large matrix B.bin of dimension 500 × 1000
An MPI program should be implemented such that it can
• accept two file names at run-time,
• let process 0 read the A and B matrices from the two data files,
• let process 0 distribute the pieces of A and B to all the other processes,
• involve all the processes to carry out the the chosen parallel algorithm
for matrix multiplication C = A * B ,
• let process 0 gather, from all the other processes, the different pieces
of C ,
• let process 0 write out the entire C matrix to a data file.
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include "mpi-utils.c"
void read_matrix_binaryformat (char*, double***, int*, int*);
void write_matrix_binaryformat (char*, double**, int, int);
void create_matrix (double***,int,int);
void matrix_multiplication (double ***, double ***, double ***,int,int, int);
int main(int argc, char *argv[]) {
int id,p; // Process rank and total amount of processes
int rowsA, colsA, rowsB, colsB; // Matrix dimensions
double **A; // Matrix A
double **B; // Matrix B
double **C; // Result matrix C : AB
int local_rows; // Local row dimension of the matrix A
double **local_A; // The local A matrix
double **local_C; // The local C matrix
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &id);
MPI_Comm_size (MPI_COMM_WORLD, &p);
if(argc != 3) {
if(id == 0) {
printf("Usage:\n>> %s matrix_A matrix_B\n",argv[0]);
}
MPI_Finalize();
exit(1);
}
if (id == 0) {
read_matrix_binaryformat (argv[1], &A, &rowsA, &colsA);
read_matrix_binaryformat (argv[2], &B, &rowsB, &colsB);
}
if (p == 1) {
create_matrix(&C,rowsA,colsB);
matrix_multiplication (&A,&B,&C,rowsA,colsB,colsA);
char* filename = "matrix_C.bin";
write_matrix_binaryformat (filename, C, rowsA, colsB);
free(A);
free(B);
free(C);
MPI_Finalize();
return 0;
}
// For this assignment we have chosen to bcast the whole matrix B:
MPI_Bcast (&B, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast (&colsA, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&colsB, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&rowsA, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&rowsB, 1, MPI_INT, 0, MPI_COMM_WORLD);
local_rows = BLOCK_SIZE(id, p, rowsA);
/* SCATTER VALUES */
int *proc_elements = (int*)malloc(p*sizeof(int)); // amount of elements for each processor
int *displace = (int*)malloc(p*sizeof(int)); // displacement of elements for each processor
int i;
for (i = 0; i<p; i++) {
proc_elements[i] = BLOCK_SIZE(i, p, rowsA)*colsA;
displace[i] = BLOCK_LOW(i, p, rowsA)*colsA;
}
create_matrix(&local_A,local_rows,colsA);
MPI_Scatterv(&A[0],&proc_elements[0],&displace[0],MPI_DOUBLE,&local_A[0],
local_rows*colsA,MPI_DOUBLE,0,MPI_COMM_WORLD);
/* END SCATTER VALUES */
create_matrix (&local_C,local_rows,colsB);
matrix_multiplication (&local_A,&B,&local_C,local_rows,colsB,colsA);
/* GATHER VALUES */
MPI_Gatherv(&local_C[0], rowsA*colsB, MPI_DOUBLE,&C[0],
&proc_elements[0],&displace[0],MPI_DOUBLE,0, MPI_COMM_WORLD);
/* END GATHER VALUES */
char* filename = "matrix_C.bin";
write_matrix_binaryformat (filename, C, rowsA, colsB);
free (proc_elements);
free (displace);
free (local_A);
free (local_C);
free (A);
free (B);
free (C);
MPI_Finalize ();
return 0;
}
void create_matrix (double ***C,int rows,int cols) {
*C = (double**)malloc(rows*sizeof(double*));
(*C)[0] = (double*)malloc(rows*cols*sizeof(double));
int i;
for (i=1; i<rows; i++)
(*C)[i] = (*C)[i-1] + cols;
}
void matrix_multiplication (double ***A, double ***B, double ***C, int rowsC,int colsC,int colsA) {
double sum;
int i,j,k;
for (i = 0; i < rowsC; i++) {
for (j = 0; j < colsC; j++) {
sum = 0.0;
for (k = 0; k < colsA; k++) {
sum = sum + (*A)[i][k]*(*B)[k][j];
}
(*C)[i][j] = sum;
}
}
}
/* Reads a 2D array from a binary file*/
void read_matrix_binaryformat (char* filename, double*** matrix, int* num_rows, int* num_cols) {
int i;
FILE* fp = fopen (filename,"rb");
fread (num_rows, sizeof(int), 1, fp);
fread (num_cols, sizeof(int), 1, fp);
/* storage allocation of the matrix */
*matrix = (double**)malloc((*num_rows)*sizeof(double*));
(*matrix)[0] = (double*)malloc((*num_rows)*(*num_cols)*sizeof(double));
for (i=1; i<(*num_rows); i++)
(*matrix)[i] = (*matrix)[i-1]+(*num_cols);
/* read in the entire matrix */
fread ((*matrix)[0], sizeof(double), (*num_rows)*(*num_cols), fp);
fclose (fp);
}
/* Writes a 2D array in a binary file */
void write_matrix_binaryformat (char* filename, double** matrix, int num_rows, int num_cols) {
FILE *fp = fopen (filename,"wb");
fwrite (&num_rows, sizeof(int), 1, fp);
fwrite (&num_cols, sizeof(int), 1, fp);
fwrite (matrix[0], sizeof(double), num_rows*num_cols, fp);
fclose (fp);
}
My task is to do a parallel matrix multiplication of matrix A and B and gather the results in matrix C.
I am doing this by dividing matrix A in rowwise pieces and each process is going to use its piece to multiply matrix B, and get back its piece from the multiplication. Then I am going to gather all the pieces from the processes and put them together to matrix C.
I allready posted a similiar question, but this code is improved and I have progressed but I am still getting a segmentation fault after the scatterv call.
So I see a few problems right away:
MPI_Bcast (&B, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
Here, you're passing not a pointer to doubles, but a pointer to a pointer to a pointer to a double (B is defined as double **B) and you're telling MPI to follow that pointer and send 1 double from there. That is not going to work.
You might think that what you're accomplishing here is sending the pointer to the matrix, from which all tasks can read the array -- that doesn't work. The processes don't share a common memory space (that's why MPI is called distributed memory programming) and the pointer doesn't go anywhere. You're actually going to have to send the contents of the matrix,
MPI_Bcast (&(B[0][0]), rowsB*colsB, MPI_DOUBLE, 0, MPI_COMM_WORLD);
and you're going to have to make sure the other processes have correctly allocated memory for the B matrix ahead of time.
There's similar pointer problems elsewhere:
MPI_Scatterv(&A[0], ..., &local_A[0]
Again, A is a pointer to a pointer to doubles (double **A) as is local_A, and you need to be pointing MPI to pointer to doubles for this to work, something like
MPI_Scatterv(&(A[0][0]), ..., &(local_A[0][0])
that error seems to be present in all the communications routines.
Remember that anything that looks like (buffer, count, TYPE) in MPI means that the MPI routines follow the pointer buffer and send the next count pieces of data of type TYPE there. MPI can't follow pointers within the buffer you sent becaue in general it doens't know they're there. It just takes the next (count * sizeof(TYPE)) bytes from pointer buffer and does whatever communications is appropriate with them. So you have to pass it a pointer to a stream of data of type TYPE.
Having said all that, it would be a lot easier to work with you on this if you had narrowed things down a bit; right now the program you've posted includes a lot of I/O stuff that's irrelevant, and it means that no one can just run your program to see what happens without first figuring out the matrix format and then generating two matrices on their own. When posting a question about source code, you really want to post a (a) small bit of source which (b) reproduces the problem and (c) is completely self-contained.
Consider this an extended comment as Jonathan Dursi has already given a fairly elaborate answer. You matrices are really represented in a weird way but at least you followed the advice given to your other question and allocate space for them as contiguous blocks and not separately for each row.
Given that, you should replace:
MPI_Scatterv(&A[0],&proc_elements[0],&displace[0],MPI_DOUBLE,&local_A[0],
local_rows*colsA,MPI_DOUBLE,0,MPI_COMM_WORLD);
with
MPI_Scatterv(A[0],&proc_elements[0],&displace[0],MPI_DOUBLE,local_A[0],
local_rows*colsA,MPI_DOUBLE,0,MPI_COMM_WORLD);
A[0] already points to the beginning of the matrix data and there is no need to make a pointer to it. The same goes for local_A[0] as well as for the parameters to the MPI_Gatherv() call.
It has been said many times already - MPI doesn't do pointer chasing and only works with flat buffers.
I've also noticed another mistake in your code - memory for your matrices is not freed correctly. You are only freeing the array of pointers and not the matrix data itself:
free(A);
should really become
free(A[0]); free(A);

MPI_Bcast a dynamic 2d array of structs

I followed Jonathan's code from here ( MPI_Bcast a dynamic 2d array ) to MPI_Bcast a dynamically allocated 2d array of structs. The struct is as follows:
typedef struct {
char num[MAX];
int neg;
} bignum;
Since this is not a native datatype, I used MPI_Type_create_struct to commit this datatype.
I omit the code here because it works on another project I was doing.
The method I used to dynamically create ("contiguous?") the array is by calling:
bignum **matrix = creatematrix(DIMMAX,DIMMAX);
which is implemented as:
bignum ** creatematrix (int num_rows, int num_cols){
int i;
bignum *one_d = (bignum*)malloc (num_rows * num_cols * sizeof (bignum));
bignum **two_d = (bignum**)malloc (num_rows * sizeof (bignum *));
for (i = 0; i < num_rows; i++)
two_d[i] = &(one_d[i*num_cols]);//one_d + (i * num_cols);
return two_d;
}
Now I'd get input and store it inside matrix and call MPI_Bcast as:
MPI_Bcast(&(matrix[0][0]), DIMMAX*DIMMAX, Bignum, 0, MPI_COMM_WORLD);
There seems to be no segmentation fault, but the problem is that only the first row (matrix[0]) is broadcasted. All other ranks beside the root have data only for the first row. Other rows remain untouched.
It looks like there's something more than allocating a contiguous memory block. Is there anything I missed causing broadcast to be unsuccessful?
EDIT:
I fumbled on a weird workaround, using nested structs. Can anyone explain why this method works while the one above doesn't?
typedef struct {
char num[MAX];
int neg;
} bignum;
typedef struct {
bignum elements[DIMMAX];
} row;
int main(/*int argc,char *argv[]*/){
int dim, my_rank, comm_sz;
int i,j,k; //iterators
bignum global_sum = {{0},0};
row *matrix = (row*)malloc(sizeof(row)*DIMMAX);
row *tempmatrix = (row*)malloc(sizeof(row)*DIMMAX);
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
//CREATE DERIVED DATATYPE Bignum
MPI_Datatype Bignum;
MPI_Datatype type[4] = {MPI_LB, MPI_CHAR, MPI_INT, MPI_UB};
int blocklen[4] = {1,MAX,1,1};
MPI_Aint disp[4];
//get offsets
MPI_Get_address(&global_sum, disp);
MPI_Get_address(&global_sum.num, disp+1);
MPI_Get_address(&global_sum.neg, disp+2);
//MPI_Get_address(&local_sum+1, disp+3);
int base;
base = disp[0];
for (i=0;i<3;i++){
disp[i] -= base;
}
disp[3] = disp[2]+4; //int
MPI_Type_create_struct(4,blocklen,disp,type,&Bignum);
MPI_Type_commit(&Bignum);
//CREATE DERIVED DATATYPE BignumRow
MPI_Datatype BignumRow;
MPI_Datatype type2[3] = {MPI_LB, Bignum, MPI_UB};
int blocklen2[3] = {1,DIMMAX,1};
MPI_Aint disp2[3] = {0,0,DIMMAX*(MAX+4)};
MPI_Type_create_struct(3,blocklen2,disp2,type2,&BignumRow);
MPI_Type_commit(&BignumRow);
MPI_Bcast(&dim, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
MPI_Bcast(matrix, dim, BignumRow, 0, MPI_COMM_WORLD);
This seems to come up a lot: If you pass around multi-dimensional arrays with MPI in C, you sort of have to think of a multi-dimensional array as an array of arrays. I think in your first example you've gotten lucky that matrix[0][0] (due to the consecutive mallocs?) sits at the beginning of a large block of memory (and hence, no seg fault).
In the second (working) example, you've described the memory addresses of not just the beginning of the memory region (matrix[0][0]), but also all the internal pointers as well.

Resources