FFTW+MPI, wrong results with fftw_mpi_plan_dft_r2c_2d planer - c

I try to compute Fourier transform with the planer fftw_mpi_plan_dft_r2c_2d of FFTW 3.3. Unfortunately, I can not make it work. The result is correct if N0 is equal to the number of processors (nb_proc) but is wrong when N0 != nb_proc.
An example showing my problem:
#include <stdio.h>
#include <complex.h>
#include <fftw3-mpi.h>
int main(int argc, char **argv)
{
/* if N0 (=ny) is equal to nb_proc, result are OK */
/* if N0 is not equal to nb_proc => bug */
const ptrdiff_t N0 = 4, N1 = 4;
int coef_norm = N0*N1;
fftw_plan plan_forward;
double *carrayX;
fftw_complex *carrayK;
ptrdiff_t n_alloc_local, i, j;
ptrdiff_t nX0loc, iX0loc_start, nK0loc, nK1loc;
/* X and K denote physical and Fourier spaces. */
int rank, nb_proc, irank;
MPI_Init(&argc, &argv);
fftw_mpi_init();
/*DETERMINE RANK OF THIS PROCESSOR*/
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/*DETERMINE TOTAL NUMBER OF PROCESSORS*/
MPI_Comm_size(MPI_COMM_WORLD, &nb_proc);
if (rank==0) printf("program test_fftw3_2Dmpi_simple\n");
printf("I'm rank (processor number) %i of size %i\n", rank, nb_proc);
n_alloc_local = fftw_mpi_local_size_2d(N0, N1/2+1, MPI_COMM_WORLD,
&nX0loc, &iX0loc_start);
carrayX = fftw_alloc_real(2 * n_alloc_local);
carrayK = fftw_alloc_complex(n_alloc_local);
/* create plan for out-of-place r2c DFT */
plan_forward = fftw_mpi_plan_dft_r2c_2d(N0, N1,
carrayX, carrayK,
MPI_COMM_WORLD,
FFTW_MEASURE);
nK0loc = nX0loc;
nK1loc = N1/2+1;
/* initialize carrayX to a constant */
for (i = 0; i < nX0loc; ++i) for (j = 0; j < N1; ++j)
carrayX[i*N1 + j] = 1.;
/* compute forward transform and normalize */
fftw_execute(plan_forward);
for (i = 0; i < nK0loc; ++i) for (j = 0; j < nK1loc; ++j)
carrayK[i*nK1loc + j] = carrayK[i*nK1loc + j]/coef_norm;
/* print carrayK, there should be only one 1 in the first case for rank=0 */
for (irank = 0; irank<nb_proc; irank++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (rank == irank)
{
for (i = 0; i < nK0loc; ++i) for (j = 0; j < nK1loc; ++j)
{
printf("rank = %i, carrayK[%ti*nK1loc + %ti] = (%6.4f, %6.4f)\n",
rank, i, j,
creal(carrayK[i*nK1loc + j]),
cimag(carrayK[i*nK1loc + j]));
}
printf("\n");
}
}
MPI_Barrier(MPI_COMM_WORLD);
fftw_destroy_plan(plan_forward);
MPI_Finalize();
}
There is something wrong in this example but I don't understand what.
For this case (N0 = 4, N1 = 4), the results are correct with
mpirun -np 4 ./test_fftw3_2Dmpi_simple
but not with
mpirun -np 2 ./test_fftw3_2Dmpi_simple
PS: same thing with the flag FFTW_MPI_TRANSPOSED_OUT.

Related

What causes MPI_Gatherv() to segfault?

I perform a computation on a subset of parallel processes, but when I join the results in the master process with the command MPI_Gatherv(a_per_process, mylen_per_process, MPI_LONG_DOUBLE, a, recvcounts, displs, MPI_LONG_DOUBLE, 0, MPI_COMM_WORLD);, I get a segmentation fault.
/*****************************************************************************
* DESCRIPTION:
* This program increments every element of the array by two.
* It extracts the averge execution time for different numbers of threads,
* We do this in order to compare the performance of each routine.
* Compile:
* $mpicc mpic.c -o mpic -fopenmp -lm -Ofast
* Run:
* $mpirun -np <maxthreads> ./mpic
******************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int j = 0;
long long int len_per_process = 0;
long long int remainder = 0;
long long int mylen_per_process = 0;
int size = 0;
int rank = 0;
int *recvcounts, *displs;
long double *a, *a_per_process;
double start_comp = 0;
double start_comm = 0;
double end_comp = 0;
double end_comm = 0;
double maxtime_comp = 0;
double maxtime_comm = 0;
int i = 0;
long nSamples = 10;
long long int length = 1.0;
int maxthreads = 0;
int testnumber = 0;
long long int minlength = 1;
long long int maxlength = 1;
int cycles = 0;
long longlength = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/*Whole array allocation in master process*/
if (rank == 0)
{
a = (long double *)malloc(length * sizeof(long double));
}
for (length = minlength; length <= maxlength; length = length * 10)
{
for (i = 1; i <= size; i = i * 2)
{
/*Data distribution to processes*/
len_per_process = length / i;
remainder = length % i;
mylen_per_process = (rank < remainder) ? (len_per_process + 1) : (len_per_process);
recvcounts = (int *)malloc(size * sizeof(int));
displs = (int *)malloc(size * sizeof(int));
MPI_Allgather(&mylen_per_process, 1, MPI_INT, recvcounts, 1, MPI_INT, MPI_COMM_WORLD);
displs[0] = 0;
for (j = 1; j < size; j++)
{
displs[j] = displs[j - 1] + recvcounts[j - 1];
}
/*Sub-Arrays Allocation and Initialisation at each process*/
a_per_process = (long double *)malloc(mylen_per_process * sizeof(long double));
for (j = 0; j < mylen_per_process; j++)
{
a_per_process[j] = 0.0;
}
if (rank <= i)
{
/*Increment elements by 2*/
start_comp = omp_get_wtime();
for (j = 0; j < nSamples; j++)
{
for (int k = 0; k < mylen_per_process; k++)
{
a_per_process[k] = a_per_process[k] + 2.0;
}
}
end_comp = omp_get_wtime() - start_comp;
start_comm = omp_get_wtime();
end_comm = omp_get_wtime() - start_comm;
}
// The following line causes a segfault:
MPI_Gatherv(a_per_process, mylen_per_process, MPI_LONG_DOUBLE, a, recvcounts, displs, MPI_LONG_DOUBLE, 0, MPI_COMM_WORLD);
// Get the maximum computation and communication time
MPI_Reduce(&end_comp, &maxtime_comp, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce(&end_comm, &maxtime_comm, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
free(a_per_process);
free(recvcounts);
free(displs);
}
}
if (rank == 0)
{
free(a);
}
MPI_Finalize();
return 0;
}
I tried double and long double types for my variables a and a_per_process, i.e. MPI_LONG_DLOUBLE and MPI_DOUBLE in the MPI_Gatherv command. The code runs when I comment this line out, meaning it doesn't abort or segfault.

Hybrid approach with OpenMP and MPI does not use same number of threads in cluster with different number of hosts

I'm testing a hybrid approach by paralleling the friendly-numbers (CAPBenchmark) program with MPI and OpenMP.
My cluster has 8 machines and each machine has a 4 core processor.
The code:
/*
* Copyright(C) 2014 Pedro H. Penna <pedrohenriquepenna#gmail.com>
*
* friendly-numbers.c - Friendly numbers kernel.
*/
#include <global.h>
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <util.h>
#include "fn.h"
/*
* Computes the Greatest Common Divisor of two numbers.
*/
static int gcd(int a, int b)
{
int c;
/* Compute greatest common divisor. */
while (a != 0)
{
c = a;
a = b%a;
b = c;
}
return (b);
}
/*
* Some of divisors.
*/
static int sumdiv(int n)
{
int sum; /* Sum of divisors. */
int factor; /* Working factor. */
sum = 1 + n;
/* Compute sum of divisors. */
for (factor = 2; factor < n; factor++)
{
/* Divisor found. */
if ((n%factor) == 0)
sum += factor;
}
return (sum);
}
/*
* Computes friendly numbers.
*/
int friendly_numbers(int start, int end)
{
int n; /* Divisor. */
int *num; /* Numerator. */
int *den; /* Denominator. */
int *totalnum;
int *totalden;
int rcv_friends;
int range; /* Range of numbers. */
int i, j; /* Loop indexes. */
int nfriends; /* Number of friendly numbers. */
int slice;
range = end - start + 1;
slice = range / nthreads;
if (rank == 0) {
num = smalloc(sizeof(int)*range);
den = smalloc(sizeof(int)*range);
totalnum = smalloc(sizeof(int)*range);
totalden = smalloc(sizeof(int)*range);
} else {
num = smalloc(sizeof(int) * slice);
den = smalloc(sizeof(int) * slice);
totalnum = smalloc(sizeof(int)*range);
totalden = smalloc(sizeof(int)*range);
}
j = 0;
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel for private(i, j, n) default(shared)
for (i = start + rank * slice; i < start + (rank + 1) * slice; i++) {
j = i - (start + rank * slice);
num[j] = sumdiv(i);
den[j] = i;
n = gcd(num[j], den[j]);
num[j] /= n;
den[j] /= n;
}
if (rank != 0) {
MPI_Send(num, slice, MPI_INT, 0, 0, MPI_COMM_WORLD);
MPI_Send(den, slice, MPI_INT, 0, 1, MPI_COMM_WORLD);
} else {
for (i = 1; i < nthreads; i++) {
MPI_Recv(num + (i * (slice)), slice, MPI_INT, i, 0, MPI_COMM_WORLD, 0);
MPI_Recv(den + (i * (slice)), slice, MPI_INT, i, 1, MPI_COMM_WORLD, 0);
}
}
if (rank == 0) {
for (i = 1; i < nthreads; i++) {
MPI_Send(num, range, MPI_INT, i, 2, MPI_COMM_WORLD);
MPI_Send(den, range, MPI_INT, i, 3, MPI_COMM_WORLD);
}
} else {
MPI_Recv(totalnum, range, MPI_INT, 0, 2, MPI_COMM_WORLD,0);
MPI_Recv(totalden, range, MPI_INT, 0, 3, MPI_COMM_WORLD,0);
}
/* Check friendly numbers. */
nfriends = 0;
if (rank == 0) {
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel for private(i, j) default(shared) reduction(+:nfriends)
for (i = rank; i < range; i += nthreads) {
for (j = 0; j < i; j++) {
/* Friends. */
if ((num[i] == num[j]) && (den[i] == den[j]))
nfriends++;
}
}
} else {
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel for private(i, j) default(shared) reduction(+:nfriends)
for (i = rank; i < range; i += nthreads) {
for (j = 0; j < i; j++) {
/* Friends. */
if ((totalnum[i] == totalnum[j]) && (totalden[i] == totalden[j]))
nfriends++;
}
}
}
if (rank == 0) {
for (i = 1; i < nthreads; i++) {
MPI_Recv(&rcv_friends, 1, MPI_INT, i, 4, MPI_COMM_WORLD, 0);
nfriends += rcv_friends;
}
} else {
MPI_Send(&nfriends, 1, MPI_INT, 0, 4, MPI_COMM_WORLD);
}
free(num);
free(den);
return (nfriends);
}
During the executions I observed the following behavior:
When I run mpirun with 4 and 8 hosts, each of the hosts uses 4 threads for processing, as expected.
However when running using only 2 hosts only 1 thread is used on each machine.
What could cause this behavior? Is there any alternative to "force" the use of the 4 threads in the case of the 2 hosts?
I assume you are using Open MPI.
The default binding policy is to bind to socket or numa domain (depending on your version). I assume your nodes are single socket, which means one MPI tasks is bound to 4 cores, and then the OpenMP runtime will likely start 4 OpenMP threads.
A special case is when you start only 2 MPI tasks. In this case, the binding policy is to bind to core, which means one MPI task in only bound to one core, and hence the OpenMP runtime only start one OpenMP thread.
In order to achieve the desired behavior, you can
mpirun --bind-to numa -np 2 ...
If it fails, you can fallback to
mpirun --bind-to socket -np 2 ...

Use MPI_Sendrecv for Conway Game of Life but the program can't exchange data in the borders

I'm trying to write a code of Conway's Game of Life, the pattern is Rabbit. And I used the Cartesian 2D method to build a processes group, the communication between them is MPI_Sendrecv. But this code didn't work, it just hanging there without any respond when I ran it. It has taken me a long time to find the problem but I got no progress. Could you please help me to figure it out? I'll be so glad for that!
#include <stdio.h>
#include "mpi.h"
#include <math.h>
#include <stdlib.h>
#define array 20
#define arrayhalf (array/2)
main(int argc, char *argv[])
{
int ndims = 2, ierr;
int p, my_rank, my_cart_rank;
MPI_Comm comm2d;
MPI_Datatype newtype;
int dims[ndims], coord[ndims];
int wrap_around[ndims];
int reorder, nrows, ncols;
int x[arrayhalf+2][arrayhalf+2], x2[arrayhalf+2][arrayhalf+2], x_rev[array+4][array+4];
int left, right, down, top;
MPI_Status status;
int tag_up = 20, tag_down =21, tag_left = 22, tag_right = 23;
long start, stop;
/*** start up initial MPI environment ***/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* hardcore 2 processes in each dimension */
nrows = ncols = (int) sqrt(p);
dims[0] = dims[1] = 2;
/* create cartesian topology for processes */
MPI_Dims_create(p, ndims, dims);
/*if (my_rank == 0)
printf("PW[%d]/[%d%]: PEdims = [%d x %d] \n", my_rank, p, dims[0], dims[1]);*/
/* create cartesian mapping, and check it is created either correct or wrong */
wrap_around[0] = wrap_around[1] = 1; /*set periodicity to be true */
reorder = 0;
ierr = 0;
ierr = MPI_Cart_create(MPI_COMM_WORLD, ndims, dims, wrap_around, reorder, &comm2d);
if (ierr != 0)
printf("ERROR[%d] creating CART\n", ierr);
MPI_Type_vector( arrayhalf, 1, arrayhalf+2, MPI_INT, &newtype);
MPI_Type_commit( &newtype );
/* get the neighbour process, which is useful for hola exchange */
int SHIFT_ROW = 0;
int SHIFT_COL = 1;
int DISP = 1;
/*** load pattern ***/
/* initialize the data array */
int i, j ;
for (i = 0; i < arrayhalf + 2 ; i++)
for (j = 0; j < arrayhalf + 2; j++)
{
x[i][j] = 0;
x2[i][j] = 0;
}
if (my_rank == 0)
{
int r,c;
r = arrayhalf / 2;
c = arrayhalf / 2;
/* rabbits pattern
1 1 1 1
1 1 1 1
1
*/
x[r][c] = 1;
x[r][c+4] = 1;
x[r][c+5] = 1;
x[r][c+6] = 1;
x[r+1][c] = 1;
x[r+1][c+1] = 1;
x[r+1][c+2] = 1;
x[r+1][c+5] = 1;
x[r+2][c+1] = 1;
}
/*** calculate the next generation ***/
int row, col;
int steps;
steps = atoi(argv[1]); /* get the generation number from command line */
start = MPI_Wtime();
int soc;
int destination;
for (i = 1; i <= steps; i++)
{
/*** use hola exchange in boundary elements ***/
int * send_buffer = (int*) malloc((arrayhalf)*sizeof(int));
int * recv_buffer = (int*) malloc((arrayhalf)*sizeof(int));
/*int * send_buffer = (int *) calloc(arrayhalf,sizeof(int));
int * recv_buffer = (int *) calloc(arrayhalf,sizeof(int));
*/
/* to up */
MPI_Cart_shift(comm2d, 1, 1, &soc,&destination);
MPI_Sendrecv( &x[1][1], arrayhalf, MPI_INT, destination, tag_up,& x[arrayhalf + 1][1], arrayhalf, MPI_INT, soc, tag_up, comm2d, &status );
/* to down */
MPI_Cart_shift(comm2d, 1, 1, &destination,&soc);
MPI_Sendrecv( &x[arrayhalf][1], arrayhalf, MPI_INT, destination, tag_down,& x[0][1], arrayhalf, MPI_INT, soc, tag_down, comm2d, &status);
/* to left */
MPI_Cart_shift(comm2d, 0, 1, &destination,&soc);
MPI_Sendrecv( &x[1][1], 1,newtype, destination, tag_left,& x[1][arrayhalf+1], 1, newtype, soc, tag_left, comm2d, &status );
/*for (j=0;j<arrayhalf;j++) {
send_buffer[j]=x[j+1][1];
}
MPI_Sendrecv( send_buffer, arrayhalf,MPI_INT, destination, tag_left,recv_buffer, arrayhalf, MPI_INT, soc, tag_left, comm2d, &status );
for (j=0;j<arrayhalf;j++) {
x[j+1][arrayhalf+1]=recv_buffer[j];
}
*/
/* to right */
MPI_Cart_shift(comm2d, 0, 1, &soc,&destination);
MPI_Sendrecv( &x[1][arrayhalf], 1, newtype, destination, tag_right, &x[1][0], 1, newtype, soc, tag_right, comm2d, &status );
/*for (j=0;j<arrayhalf;j++) {
send_buffer[j]=x[j+1][arrayhalf];
}
MPI_Sendrecv( send_buffer, arrayhalf,MPI_INT, destination, tag_right,recv_buffer, arrayhalf, MPI_INT, soc, tag_right, comm2d, &status );
for (j=0;j<arrayhalf;j++) {
x[j+1][1]=recv_buffer[j];
}
*/
/*** sum the neighbour values and get the next generation ***/
for (row = 1; row < arrayhalf; row++)
{
for (col = 1; col < arrayhalf; col++)
{
int neighbor;
neighbor = x[row - 1][col - 1] + x[row - 1][col] + x[row - 1][col + 1] + x[row][col - 1] +
x[row][col + 1] +
x[row + 1][col - 1] + x[row + 1][col] + x[row + 1][col + 1];
if (neighbor == 3)
{
x2[row][col] = 1;
}
else if (x[row][col] == 1 && neighbor == 2)
{
x2[row][col] = 1;
}
else
{
x2[row][col] = 0;
}
}
}
/* used to be swap */
for (row = 1; row < arrayhalf; row++)
{
for (col = 1; col < arrayhalf; col++)
{
x[row][col] = x2[row][col];
}
}
free(send_buffer);
free(recv_buffer);
}
/*** print the final generation ***/
int population = 0;
int* A;
int process_num = dims[0]*dims[1];
int row_indx;
int col_indx;
int k;
if(my_rank == 0)
{
A = (int*) malloc((arrayhalf+2)*(arrayhalf+2)*sizeof(int));
for (k= 1; k< process_num; k++)
{
MPI_Recv(A,(arrayhalf+2)*(arrayhalf+2), MPI_INT,k, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for (i = 0; i<arrayhalf+2; i++)
{
for (j = 0; j<arrayhalf+2; j++)
{
row_indx = (k%dims[1])*(arrayhalf+2)+i;
col_indx = (k/dims[0]*(arrayhalf+2))+j;
x_rev[row_indx][col_indx] = A[i*(arrayhalf+2)+j];
}
}
}
for (i = 0; i<arrayhalf+2; i++)
{
for (j = 0; j<arrayhalf+2; j++)
{
x_rev[i][j] = x[i][j];
}
}
for (row = 0; row < array+4; row++) {
for (col = 0; col < array+4; col++)
{
printf("%2d",x_rev[row][col]);
if(x_rev[row][col]==1)
{
population = population + 1;
}
}
printf("\n");
}
stop = MPI_Wtime();
printf("Running Time: %f\n ",stop-start);
printf("Population: %d\n",population);
printf("Generation: %d\n",steps);
}
else{
A = (int*) malloc((array+4)*(array+4)*sizeof(int));
for (i=0; i< arrayhalf +2; i++)
{
for(j = 0; j<arrayhalf+2; j++)
{
A[i*(arrayhalf+2)+j] = x[i][j];
}
}
MPI_Send(A,(arrayhalf+2)*(arrayhalf+2),MPI_INT,0,0,MPI_COMM_WORLD);
}
MPI_Comm_free( &comm2d );
MPI_Type_free( &newtype );
free(A);
MPI_Finalize();
}
I think I found the error.
It is in line 176.
Rank 0 is trying to listen to a msg from rank 0, but rank 0 is not sending a msg to itself. You should start the loop from 1 and not 0.

MPI_Scatter and MPI_Reduce

I'm trying to find the max of randomly generated numbers. Any thoughts on this...
I am using MPI_Scatter to split the randomly generated numbers into equal processes. I am using MPI_Reduce to get the MAX from each process.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <mpi.h>
#define atmost 1000
int find(int* partial_max, int from, int to){
int i, max;
printf("%d----%d\n", from, to);
max = partial_max[from];
for (i = from + 1; i <= to; i++)
if (partial_max[i] > max)
max = partial_max[i];
return max;
}
int main(){
int i, j,n, comm_sz, biggest, b, my_rank, q,result;
//1. Declare array of size 1000
int a[atmost];
//2. generate random integer of 0 to 999
srand((unsigned)time(NULL));
n = rand() % atmost;
//n = 10;
for (i = 0; i <= n; i++){
a[i] = rand() % atmost;
printf("My Numbers: %d\n", a[i]);
//a[i] = i;
}
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
//j is the size we will split each segment into
j = (n / (comm_sz-1));
int partial_max[j];
int receive_vector[j];
//Send random numbers equally to each process
MPI_Scatter(a, j, MPI_INT, receive_vector,
j, MPI_INT, 0, MPI_COMM_WORLD);
int localmax;
localmax = -1;
for (i = 0; i <= comm_sz-1; i++)
if (receive_vector[i] > localmax)
localmax = receive_vector[i];
// Get Max from each process
//MPI_Reduce(receive_vector, partial_max, j, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce(&localmax, &result, 1, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD);
if (my_rank == 0)
{
/*
biggest = -1;
for (i = 0; i < comm_sz - 1; i++){
if (i == comm_sz - 2)
b = find(partial_max, i * j, n - 1);
else
b = find(partial_max, i * j, (i + 1) * j - 1);
if (b > biggest)
biggest = b;
}*/
printf("-------------------\n");
printf("The biggest is: %d\n", result);
printf("The n is: %d\n", n);
}
MPI_Finalize();
return 0;
}
You have few bugs there:
You select (a different value of) n in each process. It is better to
select it within rank 0 and bcast to the rest of the processes.
When calculating j you divise by comm_sz-1 instead of comm_sz.
You assume n is divisible by comm_sz and that each process receives the exact same amount of numbers to process.
You loop with i going up to comm_sz-1 instead of going up to j
This is what I could find in a quick look..

MPI on C, Segmentation fault: 11

I have Mac OS X Yosemite 10.10.1 (14B25).
I have some problems with compiling the code. Here it is:
#include <stdio.h>
#include <mpi.h>
#define n 3
#define repeats 1
double abs(double item)
{
return (item > 0) ? item : -item;
}
int swap_raws (double **a, int p, int q)
{
if (p >= 0 && p < n && q >= 0 && q < n)
{
if (p == q)
return 0;
for (int i = 0; i < n; i++)
{
double temp = a[p][i];
a[p][i] = a[q][i];
a[q][i] = temp;
}
return 0;
}
else
return -1;
}
double f_column (int rank, int size, double *least)
{
double t1, t2, tbeg, tend, each_least = 1, least0;
int map[n];
double **a = malloc (sizeof (*a) * n);
int i, j, k;
for (i = 0; i < n; i++)
a[i] = malloc (sizeof (*a[i]) * n);
if (rank == 0)
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[i][j] = 1.0 / (i + j + 1);
MPI_Bcast (a, n * n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
for (i = 0; i < n; i++)
map[i] = i % size;
MPI_Barrier (MPI_COMM_WORLD);
t1 = MPI_Wtime ();
for (k = 0; k < n - 1; k++)
{
double max = abs (a[k][k]);
int column = k;
for (j = k + 1; j < n; j++)
{
double absv = abs (a[k][j]);
if (absv > max)
{
max = absv;
column = j;
}
}
if (map[k] == rank && column != k && swap_raws (a, k, column))
{
printf("ERROR SWAPPING %d and %d columns\n", k, column);
return -1;
}
MPI_Bcast (&a[k], n, MPI_DOUBLE, map[k], MPI_COMM_WORLD);
MPI_Bcast (&a[column], n, MPI_DOUBLE, map[k], MPI_COMM_WORLD);
if (map[k] == rank)
for (i = k + 1; i < n; i++)
a[k][i] /= a[k][k];
MPI_Barrier (MPI_COMM_WORLD);
MPI_Bcast (&a[k][k+1], n - k - 1, MPI_DOUBLE, map[k], MPI_COMM_WORLD);
for (i = k + 1; i < n; i++)
if (map[i] == rank)
for (j = k + 1; j < n; j++)
a[j][i] -= a[j][k] * a[i][j];
}
t2 = MPI_Wtime ();
for (i = 0; i < n; i++)
if (map[i] == rank)
for (j = 0; j < n; j++)
{
double absv = abs (a[i][j]);
if (each_least > absv)
each_least = absv;
//printf ("a[%d][%d] = %lg\n", j, i, a[i][j]);
}
MPI_Reduce (&each_least, &least0, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD);
MPI_Reduce (&t1, &tbeg, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD);
MPI_Reduce (&t2, &tend, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
for (i = 0; i < n; i++)
free (a[i]);
free (a);
if (rank == 0)
{
*least = least0;
return (tend - tbeg);
}
}
int main (int argc, char *argv[])
{
int rank, size;
double min, max, aver, least;
if (n == 0)
return 0;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
// It works!
//double try = f_column_non_parallel (rank, size, &least);
double try = f_column (rank, size, &least);
aver = max = min = try;
for (int i = 1; i < repeats; i++)
{
//double try = f_column_non_parallel (rank, size, &least);
double try = f_column (rank, size, &least);
if (try < min)
min = try;
else if (try > max)
max = try;
aver += try;
}
aver /= repeats;
MPI_Finalize ();
if (rank == 0)
printf("N: %d\nMIN: %f\nMAX: %f\nAVER: %f\nLEAST: %lg\n", size, min, max, aver, least);
return 0;
}
I have the Gilbert matrix. a(i)(j) = 1 / (i + j + 1) for i,j from 0 to n
This code should find LU decomposition using MPI in order to do it in the parallel way.
The first one process initialises the array and then broadcasts it to other processes.
Then I find the maximum in the raw and swap that columns. Then I would like to broadcast that data to every process, i.e. using MPI_Barrier (MPI_COMM_WORLD); but it says:
So, I don't know what's happened and how I can fix that problem. The same variant of the program runs without using processes and non-parallel version but doesn't work here.
If you find the solution, the example should work like that (I was calculating it by myself, you can check it too, but I can admit it's true). The matrix (here j and i vertically and horizontally respectively, it works in not such a convenient way for people but you should take it):
1 1/2 1/3 1 1/2 1/3 1 1/2 1/3 |1 1/2 1/3 |
1/2 1/3 1/4 -> 1/2 1/12 1/12 -> 1/2 1/12 1 -> |1/2 1/12 1/12 | <- answer
1/3 1/4 1/5 1/3 1/12 4/45 1/3 1/12 1/180 |1/3 1 1/180|
The source matrix so:
|1 0 0| |1 1/2 1/3 | |1 1/2 1/3|
A = |1/2 1 0| * |0 1/12 1/12 | = |1/2 1/3 1/4|
|1/3 1 1| |0 0 1/180| |1/3 1/4 1/5|
Can you help me to find out made mistake? Thank you in advance :)
Your program has a bug in the following part of the code:
double **a = malloc (sizeof (*a) * n);
[...snip...]
MPI_Bcast (a, n * n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
You are allocating 'n' pointers in "a", not an 'n * n' array. So when you do an 'n * n' size MPI_Bcast of "a", you are asking MPI to transfer from garbage memory locations that is not allocated. This is causing MPI to segfault.
You can change "a" to simply "double *" instead of "double **" and allocate 'n * n' doubles in there to fix this issue.
What grieves me the most is that f_column() is supposed to return a double, but the return value is undefined when rank != 0.
This comment caught my attention:
// It works!
//double try = f_column_non_parallel (rank, size, &least);
double try = f_column (rank, size, &least);
It suggests that the previous version of f_column() was working, and that you ran into troubles when attempting to parallelize it (I'm guessing that's what you're doing now).
How this could lead to a segfault is not immediately apparent to me though. I'd expect a floating point exception.
A couple of other points:
I'm not too comfortable with your memory allocation code (I'd probably use calloc() instead of malloc(), and sizeof() on explicit data types, etc...); it just freaks me out to see things like a[i] = malloc(sizeof (*a[i]) * n);, but it's just a matter of style, really.
You appear to have proper bound checking (indices over a are always positive and < n).
Oh, and you're redefining abs(), which is probably not a good idea.
Try to compile your code in debug mode, and run it with gdb; also run it through valgrind if you can, MacOS X should be supported by now.
You should probably take a closer look at your compiler warnings ;-)

Resources