My assignment is to parallelize some provided sequential code that takes an array of numbers and merges it with another array in order to come up with a sorted list. The first step is to initialize the array on each processor, then fill it with values on the root process. Next, I'm supposed to scatter the array out to the other processors so each processor has a chunk of the data for sorting, then sort these small lists locally and then gather them back up with the root process. My problem here is that no matter what I try the scatter won't actually send any data to the other processors. The array in these processors is always full of 0's, whereas the root processor does contain a list of numbers. Anybody care to take a look at my code and tell me what I'm missing?
The original sequential code
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/**
* Prints a vector without decimal places...
*/
void print_vector(int n, double vector[]) {
int i;
printf("[%.0f", vector[0]);
for (i = 1; i < n; i++) {
printf(", %.0f", vector[i]);
}
printf("]\n");
}
/**
* Just checks sequentially if everything is in ascending order.
*
* Note: we don't care about stability in this sort since we
* have no data attached to the double value, see
* http://en.wikipedia.org/wiki/Sorting_algorithm#Stability
* for more detail.
*/
void test_correctness(int n, double v[]) {
int i;
for (i = 1; i < n; i++) {
if (v[i] < v[i-1]) {
printf("Correctness test found error at %d: %.4f is not < %.4f but appears before it\n", i, v[i-1], v[i]);
}
}
}
/**
* Initialize random vector.
*
* You may not parallelize this (even though it could be done).
*/
void init_random_vector(int n, double v[]) {
int i, j;
for (i = 0; i < n; i++) {
v[i] = rand() % n;
}
}
double *R;
/**
* Merges two arrays, left and right, and leaves result in R
*/
void merge(double *left_array, double *right_array, int leftCount, int rightCount) {
int i,j,k;
// i - to mark the index of left aubarray (left_array)
// j - to mark the index of right sub-raay (right_array)
// k - to mark the index of merged subarray (R)
i = 0; j = 0; k =0;
while (i < leftCount && j < rightCount) {
if(left_array[i] < right_array[j])
R[k++] = left_array[i++];
else
R[k++] = right_array[j++];
}
while (i < leftCount)
R[k++] = left_array[i++];
while (j < rightCount)
R[k++] = right_array[j++];
}
/**
* Recursively merges an array of n values using group sizes of s.
* For example, given an array of 128 values and starting s value of 16
* will result in 8 groups of 16 merging into 4 groups of 32, then recursively
* calling merge_all which merges them into 2 groups of 64, then once more
* recursively into 1 group of 128.
*/
void merge_all(double *v, int n, int s) {
if (s < n) {
int i;
for (i = 0; i < n; i += 2*s) {
merge(v+i, v+i+s, s, s);
// result is in R starting at index 0
memcpy(v+i, R, 2*s*sizeof(double));
}
merge_all(v, n, 2*s);
}
}
void merge_sort(int n, double *v) {
merge_all(v, n, 1);
}
int main(int argc, char *argv[]) {
if (argc < 2) {
// () means optional
printf("usage: %s n (seed)\n", argv[0]);
return 0;
}
int n = atoi(argv[1]);
int seed = 0;
if (argc > 2) {
seed = atoi(argv[2]);
}
srand(seed);
int mpi_p, mpi_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_p);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
// init the temporary array used later for merge
R = malloc(sizeof(double)*n);
// all MPI processes will allocate a vector of length n, but only
// process 0 will initialize the values. You must distribute the
// values to processes. You will also likely need to allocate
// additional memory in your processes, make sure to clean it up
// by adding a correct free at the end of each process.
double *v = malloc(sizeof(double)*n);
if (mpi_rank == 0) {
init_random_vector(n, v);
}
double start = MPI_Wtime();
// do all the work ourselves! (you should make a better algorithm here!)
if (mpi_rank == 0) {
merge_sort(n, v);
}
double end = MPI_Wtime();
if (mpi_rank == 0) {
printf("Total time to solve with %d MPI Processes was %.6f\n", mpi_p, (end-start));
test_correctness(n, v);
}
MPI_Finalize();
free(v);
free(R);
return 0;
}
The parts I edited
double start = MPI_Wtime();
// do all the work ourselves! (you should make a better algorithm here!)
if (mpi_rank == 0)
{
//Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)
MPI_Scatter(&v, sizeOf, MPI_DOUBLE, &v, sizeOf, MPI_DOUBLE, 0, MPI_COMM_WORLD);
}
if (mpi_rank > 0)
{
//merge_sort(sizeOf, R);
printf("%f :v[0]", v[0]); printf("%s", "\n");
printf("%f :v[1]", v[1]); printf("%s", "\n");
printf("%f :v[2]", v[0]); printf("%s", "\n");
}
if (mpi_rank == 0)
{
//Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)
//MPI_Gather(&v, sizeOf, MPI_DOUBLE, test, sizeOf, MPI_DOUBLE, 0, MPI_COMM_WORLD);
//merge(v, test, n, sizeOf);
merge_sort(n, v);
}
double end = MPI_Wtime();
The code should output three values from each processor that isn't the root, but it just gives me 0's. I tried many different parameters inside the scatter function call, to no avail. Almost everything I tried produces the same output found below. Sample output:
0.000000 :v[0]
0.000000 :v[1]
0.000000 :v[2]
Total time to solve with 2 MPI Processes was 0.000052
EDIT: Calling scatter from every process is also something I tried, as it sounds like this is the way you're supposed to call it instead of just calling it from the root. However, then I get a bunch of errors instead of any output at all. Errors look like this:
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: (128)
Failing at address: (nil)
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: (128)
Failing at address: (nil)
Something is clearly wrong here. Not sure what I'm doing wrong though.
Related
Good afternoon, I've developed a 2D FFT in MPI for scientific purpose.
Everything used to work until I've implemented MPI_Scatterv.
Since I've implemented it something odd started happening. In particular if I stay below 64 modes I don't get problems, but when I push above it I get the message:
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
>--------------------------------------------------------------------------
>mpiexec noticed that process rank 0 with PID 0 on node MacBook-Pro-di-Mirco
>exited on signal 11 (Segmentation fault: 11).`
I can't figure out where is the mistake, but I'm pretty sure it is in MPI_Scatterv.
Could anyone help me please?
/********************************** Setup factors for scattering **********************************/
// Alloc the arrays
int* displs = (int *)malloc(size*sizeof(int));
int* scounts = (int *)malloc(size*sizeof(int));
int* receive = (int *)malloc(size*sizeof(int));
// Setup matrix
int modes_per_proc[size];
for (int i = 0; i < size; i++){
modes_per_proc[i] = 0;
}
// Set modes per processor
cores_handler( nx*nz, size, modes_per_proc);
// Scattering parameters
for (int i=0; i<size; ++i) {
scounts[i] = modes_per_proc[i]*ny*2;
receive[i] = scounts[i];
displs[i] = displs[i-1] + modes_per_proc[i-1] *ny*2; // *2 to handle complex numbers
if (i == 0 ) displs[0] = 0;
}
/************************************************ Data scattering ***********************************************/
MPI_Scatterv(U, scounts, displs, MPI_DOUBLE, u, receive[rank] , MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
The core_handler function:
void cores_handler( int modes, int size, int modes_per_proc[size]) {
int rank =0;
int check=0;
for (int i = 0; i < modes; i++) {
modes_per_proc[rank] = modes_per_proc[rank]+1;
rank = rank+1;
if (rank == size ) rank = 0;
}
for (int i = 0; i < size; i++){
//printf("%d modes on rank %d\n", modes_per_proc[i], i);
check = check+modes_per_proc[i];
}
if ( (int)(check - modes) != 0 ) {
printf("[ERROR] check - modes = %d!!\nUnable to scatter modes properly\nAbort... \n", check - modes);
}
I'm beginner in MPI programming. I'm trying to write a program that dynamically takes in an one dimensional arrays of different sizes (multiples of 100, 1000, 10000, 1000000 and so on) and scatters it to allotted processor cores. Processor cores calculate the sum of the received elements and send the sum back. The root process prints the sum of the elements in input array.
I used MPI_Scatter() and MPI_Reduce() to solve the problem. However, when the number of processor cores allotted are odd in number, some of the data get left out. For example, when I have input data size of 100 and 3 processes - only 99 elements are added and last one is left out.
I searched for the alternatives and found that MPI_Scatterv() can be used for uneven distribution of data. But there is no material available to guide me for it's implementation. Can someone help me? I'm posting my code here. Thanks in advance.
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
void readArray(char * fileName, double ** a, int * n);
int Numprocs, MyRank;
int mpi_err;
#define Root = 0
void init_it(int *argc, char ***argv) {
mpi_err = MPI_Init(argc, argv);
mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);
mpi_err = MPI_Comm_size(MPI_COMM_WORLD, &Numprocs);
}
int main(int argc, char** argv) {
/* .......Variables Initialisation ......*/
int index;
double *InputBuffer, *RecvBuffer, sum=0.0, psum = 0.0;
double ptime = 0.0, Totaltime= 0.0,startwtime = 0.0, endwtime = 0.0;
int Scatter_DataSize;
int DataSize;
FILE *fp;
init_it(&argc,&argv);
if (argc != 2) {
fprintf(stderr, "\n*** Usage: arraySum <inputFile>\n\n");
exit(1);
}
if (MyRank == 0) {
startwtime = MPI_Wtime();
printf("Number of nodes running %d\n",Numprocs);
/*...... Read input....*/
readArray(argv[1], &InputBuffer, &DataSize);
printf("Size of array %d\n", DataSize);
}
if (MyRank!=0) {
MPI_Recv(&DataSize, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
}
else {
int i;
for (i=1;i<Numprocs;i++) {
MPI_Send(&DataSize, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
d[i]= i*Numprocs;
}
}
Scatter_DataSize = DataSize / Numprocs;
RecvBuffer = (double *)malloc(Scatter_DataSize * sizeof(double));
MPI_Barrier(MPI_COMM_WORLD);
mpi_err = MPI_Scatter(InputBuffer, Scatter_DataSize, MPI_DOUBLE,
RecvBuffer, Scatter_DataSize, MPI_DOUBLE,
0, MPI_COMM_WORLD);
for (index = 0; index < Scatter_DataSize; index++) {
psum = psum + RecvBuffer[index];
}
//printf("Processor %d computed sum %f\n", MyRank, psum);
mpi_err = MPI_Reduce(&psum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (MyRank == 0) {
endwtime = MPI_Wtime();
Totaltime = endwtime - startwtime;
printf("Total sum %f\n",sum);
printf("Total time %f\n", Totaltime);
}
MPI_Finalize();
return 0;
}
void readArray(char * fileName, double ** a, int * n) {
int count, DataSize;
double * InputBuffer;
FILE * fin;
fin = fopen(fileName, "r");
if (fin == NULL) {
fprintf(stderr, "\n*** Unable to open input file '%s'\n\n",
fileName);
exit(1);
}
fscanf(fin, "%d\n", &DataSize);
InputBuffer = (double *)malloc(DataSize * sizeof(double));
if (InputBuffer == NULL) {
fprintf(stderr, "\n*** Unable to allocate %d-length array", DataSize);
exit(1);
}
for (count = 0; count < DataSize; count++) {
fscanf(fin, "%lf", &InputBuffer[count]);
}
fclose(fin);
*n = DataSize;
*a = InputBuffer;
}
In your case, you may just play with the sendcount[] array of MPI_Scatterv. Indeed, a trivial implementation would be to compute the number of element (let say Nelement) of type sendtype that all the processes but one will reveive. One of the processes (for instance the last one) will get the remaining data. In that case, sendcount[i] = Nelement for indexes i from 0 to p-2 (p is the number of processes in the communicator, for you MPI_COMM_WORLD). Then the process p-1 will get sendcount[p-1] = DataSize-Nelement*(p-1). Concerning the array of displacements displs[], you have just to specify the displacement (in number of elements) from which to take the outgoing data to process i (cf. [1] page 161). For the previous example this would be:
for (i=0; i<p; ++i)
displs[i]=Nelement*i;
If you decide that another process q must compute the other data, think to set the good displacement displs[q+1] for the process q+1 with 0 ≤ q < q+1 ≤ p.
[1] MPI: A Message-Passing Interface Standard (Version 3.1): http://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
The computation of the Scatter_Datasize:
Scatter_DataSize = DataSize / Numprocs;
is correct only if DataSize is a multiple of Numprocs, which in your case, as DataSize is always even, occurs when Numprocs is even. When Numprocs is odd you should explicitly compute the remainder and assign it to one MPI process, i suggest the last.
Im trying to create a multithreaded application in C for Linux with pthreads library that makes an approximation of pi using infinite series with N+1 terms.Variable N and T are passed from the command line. I am using the Nilakantha approximation formula for pi. N is the upper limit of the number sequence to sum and T would be the # of child threads that calculate that sum. For example if I run command "./pie 100 4". The parent thread will create 4 child threads indexed 0 to 3. I have a global variable called vsum that is a double array allocated dynamically using malloc to hold values. So with 4 threads and 100 as the upper bound. My progam should compute:
Thread 0 computes the partial sum for i going from 0 to 24 stored to an element vsum[0]
Thread 1 computes the partial sum for i going from 25 to 49 stored to an element vsum[1]
Thread 2 computes the partial sum for i going from 50 to 74 stored to an element vsum[2]
Thread 3 computes the partial sum for i going from 75 to 99 stored to an element vsum[3]
After each thread makes calculations. The main thread will compute the sum by adding together all numbers from vsum[0] to vsum[T-1].
Im just starting to learn about threads and processes. Any help or advice would be appreciated. Thank you.
Code I wrote so far:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
double *vsum;
int N, T;
void *PI(void *sum) //takes param sum and gets close to pi
{
int upper = (int)sum;
double pi = 0;
int k = 1;
for (int i = (N/T)*upper; i <= (N/T)*(upper+1)-1; i++)
{
pi += k*4/((2*i)*(2*i+1)*(2*i+2));
if(i = (N/T)*(upper+1)-1)
{
vsum[upper] = pi;
}
k++;
}
pthread_exit(0);
}
int main(int argc, char*argv[])
{
T = atoi(argv[2]);
N = atoi(argv[1]);
if (N<T)
{
fprintf(stderr, "Upper bound(N) < # of threads(T)\n");
return -1;
}
int pie = 0;
pthread_t tid[T]; //thread identifier
pthread_attr_t attr; //thread attributes
vsum = (double *)malloc(sizeof(double));//creates dyn arr
//Initialize vsum to [0,0...0]
for (int i = 0; i < T; i++){
{
vsum[i] = 0;
}
if(argc!=2) //command line does not give proper # of values
{
fprintf(stderr, "usage: commandline error <integer values>\n");
return -1;
}
if (atoi(argv[1]) <0) //if its is negative/sum error
{
fprintf(stderr, "%d must be >=0\n", atoi(argv[1]));
return -1;
}
//CREATE A LOOP THAT MAKES PARAM N #OF THREADS
pthread_attr_init(&attr);
for(int j =0; j < T;j++)
{
int from = (N/T)*j;
int to = (N/T)*(j+1)-1;
//CREATE ARRAY VSUM TO HOLD VALUES FOR PI APPROX.
pthread_create(&tid[j],&attr,PI,(void *)j);
printf("Thread %d computes the partial sum for i going from %d to %d stored to an element vsum[%d]\n", j, from, to, j);
}
//WAITS FOR THREADS TO FINISH
for(int j =0; j <T; i++)
{
pthread_join(tid[j], NULL);
}
//LOOP TO ADD ALL THE vsum array values to get pi approximation
for(int i = 0; i < T; i++)
{
pie += vsum[i];
}
pie = pie +3;
printf("pi computed with %d terms in %d threads is %d\n",N,T,pie);
vsum = realloc(vsum, 0);
pthread_exit(NULL);
return 0;
}
Here is the error I dont see that I get on my program: What am I missing here?
^
pie.c:102:1: error: expected declaration or statement at end of input
}
When I try to run my program I get the following:
./pie.c: line 6: double: command not found
./pie.c: line 7: int: command not found
./pie.c: line 8: int: command not found
./pie.c: line 10: syntax error near unexpected token `('
./pie.c: line 10: `void *PI(void *sum) //takes param sum and gets close to pi'
I haven't looked at logic of your code, but I see following programming errors.
Change
pthread_create(&tid[j],&attr,PI,j);
to
pthread_create(&tid[j],&attr,PI,(void *)j);
pthread_create() takes 4th param as void * which is passed to the thread function.
Also fix your thread function PI to use passed parameter as int like
void *PI(void *sum) //takes param sum and gets close to pi
{
int upper = (int)sum; //don't use `atoi` as passed param is int.
...
//your existing code
}
The 3rd error is for line
realloc(vsum, 0);
By passing 0 to re-allocate, you are effectively just freeing vsum, so you can just use free(vsum). If you indeed want to reallocate you should take the new allocated memory returned by the function something like vsum = realloc(vsum, 0);
The Syntax of pthread is
pthread_create(threadId, threadAttribute, callingMethodName, parameters of calling method);
Ex:
void printLetter( void *p)
{
int i=0;
char c=(char *)p;
while (i<10000)
{
printf("%c",c);
}
}
int main()
{
pthread_t thread_id;
char c='x';
pthread_create (&thread_id, NULL, &printLetter, &c);
pthread_join (thread_id, NULL);
return 0;
}
}
/*
Matricefilenames:
small matrix A.bin of dimension 100 × 50
small matrix B.bin of dimension 50 × 100
large matrix A.bin of dimension 1000 × 500
large matrix B.bin of dimension 500 × 1000
An MPI program should be implemented such that it can
• accept two file names at run-time,
• let process 0 read the A and B matrices from the two data files,
• let process 0 distribute the pieces of A and B to all the other processes,
• involve all the processes to carry out the the chosen parallel algorithm
for matrix multiplication C = A * B ,
• let process 0 gather, from all the other processes, the different pieces
of C ,
• let process 0 write out the entire C matrix to a data file.
*/
int main(int argc, char *argv[]) {
printf("Oblig 2 \n");
double **matrixa;
double **matrixb;
int ma,na,my_ma,my_na;
int mb,nb,my_mb,my_nb;
int i,j,k;
int myrank,numprocs;
int konstanta,konstantb;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
if(myrank==0) {
read_matrix_binaryformat ("small_matrix_A.bin", &matrixa, &ma, &na);
read_matrix_binaryformat ("small_matrix_B.bin", &matrixb, &mb, &nb);
}
//mpi broadcast
MPI_Bcast(&ma,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&mb,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&na,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&nb,1,MPI_INT,0,MPI_COMM_WORLD);
fflush(stdout);
int resta = ma % numprocs;//rest antall som har den største verdien
//int restb = mb % numprocs;
if (myrank == 0) {
printf("ma : %d",ma);
fflush(stdout);
printf("mb : %d",mb);
fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
if (resta == 0) {
my_ma = ma / numprocs;
printf("null rest\n ");
fflush(stdout);
} else {
if (myrank < resta) {
my_ma = ma / numprocs + 1;//husk + 1
} else {
my_ma = ma / numprocs; //heltalls divisjon gir nedre verdien !
}
}
my_na = na;
my_nb = nb;
double **myblock = malloc(my_ma*sizeof(double*));
for(i=0;i<na;i++) {
myblock[i] = malloc(my_na*sizeof(double));
}
//send_cnt for scatterv
//________________________________________________________________________________________________________________________________________________
int* send_cnta = (int*)malloc(numprocs*sizeof(int));//array med antall elementer sendt til hver prosess array[i] = antall elementer , i er process
int tot_elemsa = my_ma*my_na;
MPI_Allgather(&tot_elemsa,1,MPI_INT,&send_cnta[0],1,MPI_INT,MPI_COMM_WORLD);//arrays i c må sendes &array[0]
//send_disp for scatterv
//__________________________________________________________________________________
int* send_dispa = (int*)malloc(numprocs*sizeof(int)); //hvorfor trenger disp
// int* send_dispb = (int*)malloc(numprocs*sizeof(int));
//disp hvor i imagechars første element til hver prosess skal til
fflush(stdout);
if(resta==0) {
send_dispa[myrank]=myrank*my_ma*my_na;
} else if(myrank<=resta) {
if(myrank<resta) {
send_dispa[myrank]=myrank*my_ma*my_na;
} else {//my_rank == rest
send_dispa[myrank]=myrank*(my_ma+1)*my_na;
konstanta=myrank*(my_ma+1)*my_na;
}
}
MPI_Bcast(&konstanta,1,MPI_INT,resta,MPI_COMM_WORLD);
if (myrank>resta){
send_dispa[myrank]=((myrank-resta)*(my_ma*my_na))+konstanta;
}
MPI_Allgather(&send_dispa[myrank],1,MPI_INT,&send_dispa[0],1,MPI_INT,MPI_COMM_WORLD);
//___________________________________________________________________________________
printf("print2: %d" , myrank);
fflush(stdout);
//recv_buffer for scatterv
double *recv_buffera=malloc((my_ma*my_na)*sizeof(double));
MPI_Scatterv(&matrixa[0], &send_cnta[0], &send_dispa[0], MPI_UNSIGNED_CHAR, &recv_buffera[0], my_ma*my_na, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
for(i=0; i<my_ma; i++) {
for(j=0; j<my_na; j++) {
myblock[i][j]=recv_buffera[i*my_na + j];
}
}
MPI_Finalize();
return 0;
}
OLD:I get three type of errors. I can get scatterv count error, segmentationfault 11, or the processes just get stuck. It seems to be random which error I get. I run the code with 2 procs each time. When it gets stuck it gets stuck before the printf("print2: %d" , myrank);. When my friend runs the code on his own computer also with two prosesses, he does not get past by the first MPI_Bcast. Nothing is printed out when he runs it. Here is a link for the errors I get: http://justpaste.it/zs0
UPDATED PROBLEM: Now I get only a segmentation fault after " printf("print2: %d" , myrank); " before the scatterv call. EVEN if I remove all the code after the printf statement I get the segmentation fault, but only if I run the code for more than two procs.
I'm having a little difficulty tracing what you were trying to do. I think you're making the scatterv call more complicated than it needs to be though. Here's a snippet I had from a similar assignment this year. Hopefully it's a clearer example of how scatterv works.
/*********************************************************************
* Scatter A to All Processes
* - Using Scatterv for versatility.
*********************************************************************/
int *send_counts; // Send Counts
int *displacements; // Send Offsets
int chunk; // Number of Rows per Process (- Root)
int chunk_size; // Number of Doubles per Chunk
int remainder; // Number of Rows for Root Process
double * rbuffer; // Receive Buffer
// Do Some Math
chunk = m / (p - 1);
remainder = m % (p - 1);
chunk_size = chunk * n;
// Setup Send Counts
send_counts = malloc(p * sizeof(int));
send_counts[0] = remainder * n;
for (i = 1; i < p; i++)
send_counts[i] = chunk_size;
// Setup Displacements
displacements = malloc(p * sizeof(int));
displacements[0] = 0;
for (i = 1; i < p; i++)
displacements[i] = (remainder * n) + ((i - 1) * chunk_size);
// Allocate Receive Buffer
rbuffer = malloc(send_counts[my_rank] * sizeof(double));
// Scatter A Over All Processes!
MPI_Scatterv(A, // A
send_counts, // Array of counts [int]
displacements, // Array of displacements [int]
MPI_DOUBLE, // Sent Data Type
rbuffer, // Receive Buffer
send_counts[my_rank], // Receive Count - Per Process
MPI_DOUBLE, // Received Data Type
root, // Root
comm); // Comm World
MPI_Barrier(comm);
Also, this causes a segfault on my machine, no mpi... Pretty sure it's the way myblock is being allocated. You should do what #Hristo suggested in the comments. Allocate both matrices and the resultant matrix as flat arrays. That would eliminate the use of double pointers and make your life a whole lot simpler.
#include <stdio.h>
#include <stdlib.h>
void main ()
{
int na = 5;
int my_ma = 5;
int my_na = 5;
int i;
int j;
double **myblock = malloc(my_ma*sizeof(double*));
for(i=0;i<na;i++) {
myblock = malloc(my_na*sizeof(double));
}
unsigned char *recv_buffera=malloc((my_ma*my_na)*sizeof(unsigned char));
for(i=0; i<my_ma; i++) {
for(j=0; j<my_na; j++) {
myblock[i][j]=(float)recv_buffera[i*my_na + j];
}
}
}
Try allocating more like this:
// Allocate A, b, and y. Generate random A and b
double *buff=0;
if (my_rank==0)
{
int A_size = m*n, b_size = n, y_size = m;
int size = (A_size+b_size+y_size)*sizeof(double);
buff = (double*)malloc(size);
if (buff==NULL)
{
printf("Process %d failed to allocate %d bytes\n", my_rank, size);
MPI_Abort(comm,-1);
return 1;
}
// Set pointers
A = buff; b = A+m*n; y = b+n;
// Generate matrix and vector
genMatrix(m, n, A);
genVector(n, b);
}
I have a 2D array which is distributed across a MPI process grid (3 x 2 processes in this example). The values of the array are generated within the process which that chunk of the array is distributed to, and I want to gather all of those chunks together at the root process to display them.
So far, I have the code below. This generates a cartesian communicator, finds out the co-ordinates of the MPI process and works out how much of the array it should get based on that (as the array need not be a multiple of the cartesian grid size). I then create a new MPI derived datatype which will send the whole of that processes subarray as one item (that is, the stride, blocklength and count are different for each process, as each process has different sized arrays). However, when I come to gather the data together with MPI_Gather, I get a segmentation fault.
I think this is because I shouldn't be using the same datatype for sending and receiving in the MPI_Gather call. The data type is fine for sending the data, as it has the right count, stride and blocklength, but when it gets to the other end it'll need a very different derived datatype. I'm not sure how to calculate the parameters for this datatype - does anyone have any ideas?
Also, if I'm approaching this from completely the wrong angle then please let me know!
#include<stdio.h>
#include<array_alloc.h>
#include<math.h>
#include<mpi.h>
int main(int argc, char ** argv)
{
int size, rank;
int dim_size[2];
int periods[2];
int A = 2;
int B = 3;
MPI_Comm cart_comm;
MPI_Datatype block_type;
int coords[2];
float **array;
float **whole_array;
int n = 10;
int rows_per_core;
int cols_per_core;
int i, j;
int x_start, x_finish;
int y_start, y_finish;
/* Initialise MPI */
MPI_Init(&argc, &argv);
/* Get the rank for this process, and the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
/* If we're the master process */
whole_array = alloc_2d_float(n, n);
/* Initialise whole array to silly values */
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
whole_array[i][j] = 9999.99;
}
}
for (j = 0; j < n; j ++)
{
for (i = 0; i < n; i++)
{
printf("%f ", whole_array[j][i]);
}
printf("\n");
}
}
/* Create the cartesian communicator */
dim_size[0] = B;
dim_size[1] = A;
periods[0] = 1;
periods[1] = 1;
MPI_Cart_create(MPI_COMM_WORLD, 2, dim_size, periods, 1, &cart_comm);
/* Get our co-ordinates within that communicator */
MPI_Cart_coords(cart_comm, rank, 2, coords);
rows_per_core = ceil(n / (float) A);
cols_per_core = ceil(n / (float) B);
if (coords[0] == (B - 1))
{
/* We're at the far end of a row */
cols_per_core = n - (cols_per_core * (B - 1));
}
if (coords[1] == (A - 1))
{
/* We're at the bottom of a col */
rows_per_core = n - (rows_per_core * (A - 1));
}
printf("X: %d, Y: %d, RpC: %d, CpC: %d\n", coords[0], coords[1], rows_per_core, cols_per_core);
MPI_Type_vector(rows_per_core, cols_per_core, cols_per_core + 1, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
array = alloc_2d_float(rows_per_core, cols_per_core);
if (array == NULL)
{
printf("Problem with array allocation.\nExiting\n");
return 1;
}
for (j = 0; j < rows_per_core; j++)
{
for (i = 0; i < cols_per_core; i++)
{
array[j][i] = (float) (i + 1);
}
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Gather(array, 1, block_type, whole_array, 1, block_type, 0, MPI_COMM_WORLD);
/*
if (rank == 0)
{
for (j = 0; j < n; j ++)
{
for (i = 0; i < n; i++)
{
printf("%f ", whole_array[j][i]);
}
printf("\n");
}
}
*/
/* Close down the MPI environment */
MPI_Finalize();
}
The 2D array allocation routine I have used above is implemented as:
float **alloc_2d_float( int ndim1, int ndim2 ) {
float **array2 = malloc( ndim1 * sizeof( float * ) );
int i;
if( array2 != NULL ){
array2[0] = malloc( ndim1 * ndim2 * sizeof( float ) );
if( array2[ 0 ] != NULL ) {
for( i = 1; i < ndim1; i++ )
array2[i] = array2[0] + i * ndim2;
}
else {
free( array2 );
array2 = NULL;
}
}
return array2;
}
This is a tricky one. You're on the right track, and yes, you will need different types for sending and receiving.
The sending part is easy -- if you're sending the whole subarray array, then you don't even need the vector type; you can send the entire (rows_per_core)*(cols_per_core) contiguous floats starting at &(array[0][0]) (or array[0], if you prefer).
It's the receiving that's the tricky part, as you've gathered. Let's start with the simplest case -- assuming that everything divides evenly so all the blocks have the same size. Then you can use the very helfpul MPI_Type_create_subarray (you could always cobble this together with vector types, but for higher-dimensional arrays this becomes tedious, as you need to create 1 intermediate type for each dimension of the array except the last...
Also, rather than hardcoding the decomposition, you can use the also-helpful MPI_Dims_create to create an as-square-as-possible decomposition of your ranks. Note
that this doesn't necessarily have anything to do with MPI_Cart_create, although you can use it for the requested dimensions. I'm going to skip the cart_create stuff here, not because it's not useful, but because I want to focus on the gather stuff.
So if everyone has the same size of array, then root is receiving the same data type from everyone, and one can use a very simple subarray type to get their data:
MPI_Type_create_subarray(2, whole_array_size, sub_array_size, starts,
MPI_ORDER_C, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
where sub_array_size[] = {rows_per_core, cols_per_core}, whole_array_size[] = {n,n}, and for here, starts[]={0,0} - eg, we'll just assume that everything starts the start.
The reason for this is that we can then use Gatherv to explicitly set the displacements into the array:
for (int i=0; i<size; i++) {
counts[i] = 1; /* one block_type per rank */
int row = (i % A);
int col = (i / A);
/* displacement into the whole_array */
disps[i] = (col*cols_per_core + row*(rows_per_core)*n);
}
MPI_Gatherv(array[0], rows_per_core*cols_per_core, MPI_FLOAT,
recvptr, counts, disps, resized_type, 0, MPI_COMM_WORLD);
So now everyone sends their data in one chunk, and it's received into the type into the right part of the array. For this to work, I've resized the type so that it's extent is just one float, so the displacements can be calculated in that unit:
MPI_Type_create_resized(block_type, 0, 1*sizeof(float), &resized_type);
MPI_Type_commit(&resized_type);
The whole code is below:
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<mpi.h>
float **alloc_2d_float( int ndim1, int ndim2 ) {
float **array2 = malloc( ndim1 * sizeof( float * ) );
int i;
if( array2 != NULL ){
array2[0] = malloc( ndim1 * ndim2 * sizeof( float ) );
if( array2[ 0 ] != NULL ) {
for( i = 1; i < ndim1; i++ )
array2[i] = array2[0] + i * ndim2;
}
else {
free( array2 );
array2 = NULL;
}
}
return array2;
}
void free_2d_float( float **array ) {
if (array != NULL) {
free(array[0]);
free(array);
}
return;
}
void init_array2d(float **array, int ndim1, int ndim2, float data) {
for (int i=0; i<ndim1; i++)
for (int j=0; j<ndim2; j++)
array[i][j] = data;
return;
}
void print_array2d(float **array, int ndim1, int ndim2) {
for (int i=0; i<ndim1; i++) {
for (int j=0; j<ndim2; j++) {
printf("%6.2f ", array[i][j]);
}
printf("\n");
}
return;
}
int main(int argc, char ** argv)
{
int size, rank;
int dim_size[2];
int periods[2];
MPI_Datatype block_type, resized_type;
float **array;
float **whole_array;
float *recvptr;
int *counts, *disps;
int n = 10;
int rows_per_core;
int cols_per_core;
int i, j;
int whole_array_size[2];
int sub_array_size[2];
int starts[2];
int A, B;
/* Initialise MPI */
MPI_Init(&argc, &argv);
/* Get the rank for this process, and the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
/* If we're the master process */
whole_array = alloc_2d_float(n, n);
recvptr = &(whole_array[0][0]);
/* Initialise whole array to silly values */
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
whole_array[i][j] = 9999.99;
}
}
print_array2d(whole_array, n, n);
puts("\n\n");
}
/* Create the cartesian communicator */
MPI_Dims_create(size, 2, dim_size);
A = dim_size[1];
B = dim_size[0];
periods[0] = 1;
periods[1] = 1;
rows_per_core = ceil(n / (float) A);
cols_per_core = ceil(n / (float) B);
if (rows_per_core*A != n) {
if (rank == 0) fprintf(stderr,"Aborting: rows %d don't divide by %d evenly\n", n, A);
MPI_Abort(MPI_COMM_WORLD,1);
}
if (cols_per_core*B != n) {
if (rank == 0) fprintf(stderr,"Aborting: cols %d don't divide by %d evenly\n", n, B);
MPI_Abort(MPI_COMM_WORLD,2);
}
array = alloc_2d_float(rows_per_core, cols_per_core);
printf("%d, RpC: %d, CpC: %d\n", rank, rows_per_core, cols_per_core);
whole_array_size[0] = n;
sub_array_size [0] = rows_per_core;
whole_array_size[1] = n;
sub_array_size [1] = cols_per_core;
starts[0] = 0; starts[1] = 0;
MPI_Type_create_subarray(2, whole_array_size, sub_array_size, starts,
MPI_ORDER_C, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
MPI_Type_create_resized(block_type, 0, 1*sizeof(float), &resized_type);
MPI_Type_commit(&resized_type);
if (array == NULL)
{
printf("Problem with array allocation.\nExiting\n");
MPI_Abort(MPI_COMM_WORLD,3);
}
init_array2d(array,rows_per_core,cols_per_core,(float)rank);
counts = (int *)malloc(size * sizeof(int));
disps = (int *)malloc(size * sizeof(int));
/* note -- we're just using MPI_COMM_WORLD rank here to
* determine location, not the cart_comm for now... */
for (int i=0; i<size; i++) {
counts[i] = 1; /* one block_type per rank */
int row = (i % A);
int col = (i / A);
/* displacement into the whole_array */
disps[i] = (col*cols_per_core + row*(rows_per_core)*n);
}
MPI_Gatherv(array[0], rows_per_core*cols_per_core, MPI_FLOAT,
recvptr, counts, disps, resized_type, 0, MPI_COMM_WORLD);
free_2d_float(array);
if (rank == 0) print_array2d(whole_array, n, n);
if (rank == 0) free_2d_float(whole_array);
MPI_Finalize();
}
Minor thing -- you don't need the barrier before the gather. In fact, you hardly ever really need a barrier, and they're expensive operations for a few reasons, and can hide problems -- my rule of thumb is to never, ever, use barriers unless you know exactly why the rule needs to be broken in this case. In this case in particular, the collective gather routine does exactly the same syncronization as the barrier, so just use that.
Now, moving onto the harder stuff. If things don't divide evenly, you have a few options. The simplest, though not necessarily the best, is just to pad the array so that it does divide evenly, even if just for this operation.
If you can arrange it so that the number of columns does divide evenly, even if the number of rows doesn't, then you can still use the gatherv and create a vector type for each part of the row, and gatherv that the appropriate number of rows from each processor. That would work fine.
If you definately have the case where neither can be counted on to divide, and you can't pad data for sending, then there are three sub-options I can see:
As susterpatt suggests, do point-to-point. For small numbers of tasks, this is fine, but as it gets larger, this will be significantly less efficient than the collective operations.
Create a communicator consisting of all the processors not on the outer edges, and use exactly the code above to gather their code; and then point-to-point the edge tasks' data.
Don't gather to process 0 at all; use the Distributed array type to describe the layout of the array, and use MPI-IO to write all the data to a file; once that's done, you can have process zero display the data in some way if you like.
It looks like the first argument to you MPI_Gather call should probably be array[0], and not array.
Also, if you need to get different amounts of data from each rank, you might be better off using MPI_Gatherv.
Finally, not that gathering all your data in once place to do output is not scalable in many circumstances. As the amount of data grows, eventually, it will exceed the memory available to rank 0. You might be much better off distributing the output work (if you are writing to a file, using MPI IO or other library calls) or doing point-to-point sends to rank 0 one at a time, to limit the total memory consumption.
On the other hand, I would not recommend coordinating each of your ranks printing to standard output, one after another, because some major MPI implementations don't guarantee that standard output will be produced in order. Cray's MPI, in particular, jumbles up standard output pretty thoroughly if multiple ranks print.
Accordding to this (emphasis by me):
The type-matching conditions for the collective operations are more strict than the corresponding conditions between sender and receiver in point-to-point. Namely, for collective operations, the amount of data sent must exactly match the amount of data specified by the receiver. Distinct type maps between sender and receiver are still allowed.
Sounds to me like you have two options:
Pad smaller submatrices so that all processes send the same amount of data, then crop the matrix back to its original size after the Gather. If you're feeling adventurous, you might try defining the receiving typemap so that paddings are automatically overwritten during the Gather operation, thus eliminating the need for the crop afterwards. This could get a bit complicated though.
Fall back to point-to-point communication. Much more straightforward, but possibly higher communication costs.
Personally, I'd go with option 2.