I want to transport a struct between processes and for that I am trying to create a MPI struct. The code is for an Ant Colony Optimization (ACO) Algorithm.
The header file with he C struct contains:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <math.h>
#include <mpi.h>
/* Constants */
#define NUM_CITIES 100 // Number of cities
//among others
typedef struct {
int city, next_city, tabu[NUM_CITIES], path[NUM_CITIES], path_index;
double tour_distance;
} ACO_Ant;
I tried to build my code as suggested in this thread.
Program code:
int main(int argc, char *argv[])
{
MPI_Datatype MPI_TABU, MPI_PATH, MPI_ANT;
// Initialize MPI
MPI_Init(&argc, &argv);
//Determines the size (&procs) of the group associated with a communicator (MPI_COMM_WORLD)
MPI_Comm_size(MPI_COMM_WORLD, &procs);
//Determines the rank (&rank) of the calling process in the communicator (MPI_COMM_WORLD)
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Type_contiguous(NUM_CITIES, MPI_INT, &MPI_TABU);
MPI_Type_contiguous(NUM_CITIES, MPI_INT, &MPI_PATH);
MPI_Type_commit(&MPI_TABU);
MPI_Type_commit(&MPI_PATH);
// Create ant struct
//int city, next_city, tabu[NUM_CITIES], path[NUM_CITIES], path_index;
//double tour_distance;
int blocklengths[6] = {1,1, NUM_CITIES, NUM_CITIES, 1, 1};
MPI_Datatype types[6] = {MPI_INT, MPI_INT, MPI_TABU, MPI_PATH, MPI_INT, MPI_DOUBLE};
MPI_Aint offsets[6] = { offsetof( ACO_Ant, city ), offsetof( ACO_Ant, next_city), offsetof( ACO_Ant, tabu), offsetof( ACO_Ant, path ), offsetof( ACO_Ant, path_index ), offsetof( ACO_Ant, tour_distance )};
MPI_Datatype tmp_type;
MPI_Aint lb, extent;
MPI_Type_create_struct(6, blocklengths, offsets, types, &tmp_type);
MPI_Type_get_extent( tmp_type, &lb, &extent );
//Tried all of these
MPI_Type_create_resized( tmp_type, lb, extent, &MPI_ANT );
//MPI_Type_create_resized( tmp_type, 0, sizeof(MPI_ANT), &MPI_ANT );
//MPI_Type_create_resized( tmp_type, 0, sizeof(ant), &MPI_ANT );
MPI_Type_commit(&MPI_ANT);
printf("Return: %d\n" , MPI_Bcast(ant, NUM_ANTS, MPI_ANT, 0, MPI_COMM_WORLD));
}
But once the program reaches the MPI_Bcast command, it crashes with Error Code 11, which I presume is MPI_ERR_TOPOLOGY as per this manual. is a segfault (signal 11).
I am also unsure about some of the code why the author of the original program -
Can some one explain why they create
MPI_Aint displacements[3];
MPI_Datatype typelist[3];
of size 3, when the struct has 2 variables?
int block_lengths[2];
Code:
void ACO_Build_best(ACO_Best_tour *tour, MPI_Datatype *mpi_type /*out*/)
{
int block_lengths[2];
MPI_Aint displacements[3];
MPI_Datatype typelist[3];
MPI_Aint start_address;
MPI_Aint address;
block_lengths[0] = 1;
block_lengths[1] = NUM_CITIES;
typelist[0] = MPI_DOUBLE;
typelist[1] = MPI_INT;
displacements[0] = 0;
MPI_Address(&(tour->distance), &start_address);
MPI_Address(tour->path, &address);
displacements[1] = address - start_address;
MPI_Type_struct(2, block_lengths, displacements, typelist, mpi_type);
MPI_Type_commit(mpi_type);
}
All and any help will be appreciated.
Edit: help with solving the problem, not marginally useful StackOverflow jargon
This part is wrong:
int blocklengths[6] = {1,1, NUM_CITIES, NUM_CITIES, 1, 1};
MPI_Datatype types[6] = {MPI_INT, MPI_INT, MPI_TABU, MPI_PATH, MPI_INT, MPI_DOUBLE};
MPI_Aint offsets[6] = { offsetof( ACO_Ant, city ), offsetof( ACO_Ant, next_city), offsetof( ACO_Ant, tabu), offsetof( ACO_Ant, path ), offsetof( ACO_Ant, path_index ), offsetof( ACO_Ant, tour_distance )};
The MPI_TABU and MPI_PATH datatypes already cover NUM_CITIES elements. When you specify the corresponding block size to also be NUM_CITIES, the resultant datatype will try to access NUM_CITIES * NUM_CITIES elements, likely resulting in a segfault (signal 11).
Either set all elements of blocklengths to 1 or replace MPI_TABU and MPI_PATH in the types array with MPI_INT.
This part is also wrong:
MPI_Type_create_struct(6, blocklengths, offsets, types, &tmp_type);
MPI_Type_get_extent( tmp_type, &lb, &extent );
//Tried all of these
MPI_Type_create_resized( tmp_type, lb, extent, &MPI_ANT );
//MPI_Type_create_resized( tmp_type, 0, sizeof(MPI_ANT), &MPI_ANT );
//MPI_Type_create_resized( tmp_type, 0, sizeof(ant), &MPI_ANT );
MPI_Type_commit(&MPI_ANT);
Calling MPI_Type_create_resized with the values returned by MPI_Type_get_extent is meaningless since it just duplicates the type without actually resizing it. Using sizeof(MPI_ANT) is wrong since MPI_ANT is not a C type but an MPI handle, which is either an integer index or a pointer (implementation-dependent). It will work with sizeof(ant) if ant is of type ACO_Ant, but given you call MPI_Bcast(ant, NUM_ANTS, ...), then ant is either a pointer, in which case sizeof(ant) is just the pointer size, or it is an array, in which case sizeof(ant) is NUM_ANTS times larger than it must be. The correct call is:
MPI_Type_create_resized(tmp_type, 0, sizeof(ACO_Ant), &ant_type);
MPI_Type_commit(&ant_type);
And please, never use MPI_ as prefix in your own variable or function names. This makes the code unreadable and is very misleading ("is that a predefined MPI datatype or a user-defined one?")
As for the last question, the author might have had a different structure in mind. Nothing stops you from using larger arrays as long as you call MPI_Type_create with the correct number of significant elements.
Note: You don't have to commit MPI datatypes that are never used directly in communication calls. I.e., those two lines are unnecessary:
MPI_Type_commit(&MPI_TABU);
MPI_Type_commit(&MPI_PATH);
Related
Trying to send an array of my data structure, I receive a strange error which, however, does not stop my execution and the data sent seems arrive correctly.
Doing various propre I noticed that the offset calculated by the "offestof" function changes according to the size of the word array but not uniformly.
My struct
#define WORD_SIZE 40
typedef struct Word{
char word[WORD_SIZE];
int frequecy;
}Word;
How i commit it
MPI_Datatype types[2] = {MPI_CHAR, MPI_INT};
int blocklengths[2] = {WORD_SIZE,1};
MPI_Aint offsets[2] = {
offsetof(Word, word),
offsetof(Word, frequecy),
};
printf("%ld %ld\n",offsetof(Word, word),offsetof(Word, frequecy));
MPI_Datatype MPI_MY_WORD;
MPI_Type_create_struct(2, blocklengths, offsets, types, &MPI_MY_WORD);
MPI_Type_commit(&MPI_MY_WORD);
Executing an execution with WORD_SIZE 40 i get this error [a1e112a4a20e:02386] Read -1, expected 114180, errno = 1
and the printf prints 0 40.
Instead by setting WORD_SIZE to 30 I have no error and the prinf prints 0 32.
I do not understand why the offset values do not change uniformly and why only in some values of WORD_SIZE I have errors from mpi.
Here is the part of the code where I send and receive the data.
if(rank == 1){
...
to_send_array[size];
...
MPI_Send(to_send_array, size, MPI_MY_WORD, 0, 1, MPI_COMM_WORLD);
}else if(rank == 0){
MPI_Status status;
Word buff[size];
MPI_Recv(buff, size, MPI_MY_WORD, 1, 1, MPI_COMM_WORLD, &status);
printf("P(%d) first elemeent %s %d\n",rank,buff[0].word,buff[0].frequecy);
}
I am using Open MPI 2.1.1
Anyone who can help me?
I would like to have an example showing how to use MPI_Type_create_subarray to build 2D cyclic distribution for large matrix.
I know that MPI_Type_create_darray will give me 2D cyclic distribution, but it is not compatible with SCALAPACK process grid.
I would to do 2d block cyclic distribution using MPI_Type_create_subarray and pass the matrices to SCALAPACK routines.
Could I have an example showing this?
There are at least two parts to your question. The following sections address these two component pieces, but leave integration of the two to you. The example code contained below in both sections, along with explanations provided in the ScaLapack link below should provide some guidance...
From DeinoMPI:
The following sample code illustrates MPI_Type_create_subarray.
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
int myrank;
MPI_Status status;
MPI_Datatype subarray;
int array[9] = { -1, 1, 2, 3, -2, -3, -4, -5, -6 };
int array_size[] = {9};
int array_subsize[] = {3};
int array_start[] = {1};
int i;
MPI_Init(&argc, &argv);
/* Create a subarray datatype */
MPI_Type_create_subarray(1, array_size, array_subsize, array_start, MPI_ORDER_C, MPI_INT, &subarray);
MPI_Type_commit(&subarray);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0)
{
MPI_Send(array, 1, subarray, 1, 123, MPI_COMM_WORLD);
}
else if (myrank == 1)
{
for (i=0; i<9; i++)
array[i] = 0;
MPI_Recv(array, 1, subarray, 0, 123, MPI_COMM_WORLD, &status);
for (i=0; i<9; i++)
printf("array[%d] = %d\n", i, array[i]);
fflush(stdout);
}
MPI_Finalize();
return 0;
}
And from ScaLapack in C essentials:
Unfortunately, there is no C interface for ScaLAPACK or PBLAS.All
parametersshould be passed into routines and functionsby reference,
you can also define constants (i_one for 1, i_negone for -1, d_two for
2.0E+0 etc.) to pass into routines.Matrices should bestoredas 1d array(A[ i + lda*j ], not A[i][j])
To invoke ScaLAPACK routines in your program, you should first
initialize grid via BLACS routines (BLACS is enough). Second, you
should distribute your matrix over process grid (block cyclic 2d
distribution). You can do this by means of pdgeadd_ PBLAS routine.
This routine cumputes sum of two matrices A, B: B:=alphaA+betaB).
Matrices can have different distribution,in particularmatrixA can be
owned by only one process, thus, setting alpha=1, beta=0 you cansimply
copy your non-distributed matrix A into distributed matrix B.
Third, call pdgeqrf_ for matrix B. In the end of ScaLAPACK part of
code, you can collect results on one process (just copy distributed
matrix into local one via pdgeadd_). Finally, close grid via
blacs_gridexit_ and blacs_exit_.
After all, ScaLAPACK-using program should contain following:
void main(){
// Useful constants
const int i_one = 1, i_negone = -1, i_zero = 0;
const double zero=0.0E+0, one=1.0E+0;
... (See the rest of code in linked location above...)
I would like to gather data from arrays of double and organize them at the same time. Say we have 2 MPI ranks:
if(rank == 0)
P = {0,1,4,5,8,9};
else
P = {2,3,6,7,10,11};
How could I gather the information located in P and locate them in order, i.e: P in the master should contain P= [0 1 2....9 10 11]
I could gather P as it is, and then reorganizing it in the root however this approach would not be very efficient as P is increased. I have tried creating an MPI_Type_vector however I have not managed to get it right yet. Any ideas?
It depends a little bit on what you mean by "in order". If you mean that, as in the above example, each vector is made up of blocks of data and you want those blocks interleaved in a fixed known order, yes, you can certainly do this. (The question could also be read to be asking if you can do a sort as part of the gather; that's rather harder.)
You have the right approach; you want to send the data as is, but receive the data into specified chunks broken up by processor. Here, the data type you want to receive into looks like this:
MPI_Datatype vectype;
MPI_Type_vector(NBLOCKS, BLOCKSIZE, size*BLOCKSIZE, MPI_CHAR, &vectype);
That is, for a given processor's input, you're going to receive it into NBLOCKS blocks of size BLOCKSIZE, each separated by however many processors there are times the blocksize. As it is, you could receive into that type; to gather into that type, however, you need to set the extents so that the data from each processor is gathered into the right place:
MPI_Datatype gathertype;
MPI_Type_create_resized(vectype, 0, BLOCKSIZE*sizeof(char), &gathertype);
MPI_Type_commit(&gathertype);
The reason for that resizing is given in, for instance, this answer, and likely elsewhere on this site as well.
Putting this together into sample code gives us the following:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
const int BLOCKSIZE=2; /* each block of data is 2 items */
const int NBLOCKS =3; /* each task has 3 such blocks */
char locdata[NBLOCKS*BLOCKSIZE];
for (int i=0; i<NBLOCKS*BLOCKSIZE; i++)
locdata[i] = 'A' + (char)rank; /* rank 0 = 'AAA..A'; rank 1 = 'BBB..B', etc */
MPI_Datatype vectype, gathertype;
MPI_Type_vector(NBLOCKS, BLOCKSIZE, size*BLOCKSIZE, MPI_CHAR, &vectype);
MPI_Type_create_resized(vectype, 0, BLOCKSIZE*sizeof(char), &gathertype);
MPI_Type_commit(&gathertype);
char *globaldata = NULL;
if (rank == 0) globaldata = malloc((NBLOCKS*BLOCKSIZE*size+1)*sizeof(char));
MPI_Gather(locdata, BLOCKSIZE*NBLOCKS, MPI_CHAR,
globaldata, 1, gathertype,
0, MPI_COMM_WORLD);
if (rank == 0) {
globaldata[NBLOCKS*BLOCKSIZE*size] = '\0';
printf("Assembled data:\n");
printf("<%s>\n", globaldata);
free(globaldata);
}
MPI_Type_free(&gathertype);
MPI_Finalize();
return 0;
}
Running gives:
$ mpirun -np 3 ./vector
Assembled data:
<AABBCCAABBCCAABBCC>
$ mpirun -np 7 ./vector
Assembled data:
<AABBCCDDEEFFGGAABBCCDDEEFFGGAABBCCDDEEFFGG>
I am trying to convert a program I did to OpenCL, but I am not familiar enough with it yet. Still, I am having trouble with one of my (three) kernels. It is basically a complex matrix vector multiplication, but I am writing it so to fit better with my needs.
The problem is, I can't get the kernel to work on GPU. I have simplified it to the most (2 lines), debugged on CPU, and it works perfectly ok on a CPU. But when it comes to GPU, everything screws up. I'm working on a MacBook Pro, and on a NVIDIA GeForce 650M I get one result, while on the integrated Intel HD 4000, I get another. The kernel is
__kernel void Chmv_(__global float2 *H, const float alpha, __global float2 *vec,
const int off/*in number of elements*/,
__local float2 *vw,
__global float2 *vout)
{
int gidx=get_global_id(0);
int gidy=get_global_id(1);
int gs=get_global_size(0);
vout[gidx].x += alpha*(H[gidx+gidy*gs].x*vec[gidy].x-H[gidx+gidy*gs].y*vec[gidy].y);
vout[gidx].y += alpha*(H[gidx+gidy*gs].y*vec[gidy].x+H[gidx+gidy*gs].x*vec[gidy].y);
}
For tests, I let the Matrix H be a 4x4 matrix, filled with (1.0f, 0.0f), while input vector vec is has x components (0.0, 1.0, 2.0, 3.0), and y components 0. alpha is set to 2.0f. So, I should have (12, 12, 12, 12) as x output, and I do, if I use CPU. NVIDIA gives me 6.0, while Intel gives me 4.0.
Now, closer inspection showed me that if the input vector is (0,1,2,0), NVIDIA gives me 0 as answer, and if it is (0,1,0,3), Intel gives 0 as well. By the way, changing vec[gidy] for vec[gidx] gives me just the vector doubled. From these, it seems to me that the kernel is executing well only in one dimension, x, while having only one value for get_global_id(1), which is clearly not ok.
I will add the test function which is calling this kernel inspection. Now, anyone has any idea of what can be going on?
void _test_(){
cl_mem mat,vec, out;
size_t gs[2]={4,4};
size_t ls[2]={1,4};
size_t cpuws[2]={1,1};
cl_float2 *A=(cl_float2*)calloc(gs[0]*gs[0], sizeof(cl_float2));
cl_float2 *v=(cl_float2*)calloc(gs[0], sizeof(cl_float2));
cl_float2 *w=(cl_float2*)calloc(gs[0], sizeof(cl_float2));
int i;
for (i=0; i<gs[0]; i++) {
A[i*gs[0]].x=1.0;
A[i*gs[0]+1].x= 1.0;//(i<ls-1)? 1.0f:0.0f;
A[i*gs[0]+2].x=1.0;
A[i*gs[0]+3].x=1.0;
v[i].x= (float)i;
printf("%d %f %f %f %f\n%v2f\n",i, A[i*gs[0]].x, A[i*gs[0]+1].x, A[i*gs[0]+2].x, A[i*gs[0]+3].x, v[i]);
}
v[2].x=0.0f; //<--- set individually for debug
mat = clCreateBuffer(context, CL_MEM_READ_WRITE, gs[0]*gs[0]*sizeof(cl_float2), NULL, NULL);
vec = clCreateBuffer(context, CL_MEM_READ_WRITE, gs[0]*sizeof(cl_float2), NULL, NULL);
out = clCreateBuffer(context, CL_MEM_READ_WRITE, gs[0]*sizeof(cl_float2), NULL, NULL);
error = clEnqueueWriteBuffer(queue, mat, CL_TRUE, 0, gs[0]*gs[0]*sizeof(cl_float2), A, 0, NULL, NULL);
error = clEnqueueWriteBuffer(queue, vec, CL_TRUE, 0, gs[0]*sizeof(cl_float2), v, 0, NULL, NULL);
error = clEnqueueWriteBuffer(queue, out, CL_TRUE, 0, gs[0]*sizeof(cl_float2), w, 0, NULL, NULL);
int offset=0;
float alpha=2.0;
error = clSetKernelArg(Chmv_, 0, sizeof(cl_mem),&mat);
error |= clSetKernelArg(Chmv_, 1, sizeof(float), &alpha);
error |= clSetKernelArg(Chmv_, 2, sizeof(cl_mem),&vec);
error |= clSetKernelArg(Chmv_, 3, sizeof(int), &offset);
error |= clSetKernelArg(Chmv_, 4, gs[0]*sizeof(cl_float2), NULL);
error |= clSetKernelArg(Chmv_, 5, sizeof(cl_mem), &out);
assert(error == CL_SUCCESS);
error = clEnqueueNDRangeKernel(queue, Chmv_, 2, NULL, gs, NULL, 0, NULL, &event);
error = clEnqueueReadBuffer(queue, out, CL_TRUE, 0, gs[0]*sizeof(cl_float2), w, 0, NULL, NULL);
clFinish(queue);
for (i=0; i<gs[0]; i++) {
printf("%f %f\n", w[i].x, w[i].y);
}
clReleaseMemObject(mat);
clReleaseMemObject(vec);
clReleaseMemObject(out);
}
You are experiencing a typical problem of a multithreaded unsafe access to a common memory zone. (vout)
You have to think that all of the work-items will run concurrently. This means, they will read and write memory in any order.
When you execute in CPU, the problem does not show up since the execution is serially done by the HW.
However in the GPU, some work items read the memory of vout, increment it and write it. But others do also read the memory of vout before the new value is written by the previous work items.
Probably all your work items are running in parallel since your kernel size is small, that's why you only see one of them adding to the final result.
This is a typical parallel reduction problem. You can google it for more details. What you need to achieve is sync all the threads when accesing vout, either by an atomic_add() (slow) or by a proper reduction (hard to code). You can check this guide, it is for CUDA but is more or less the same basic idea : Reduction Guide
I'm looking at someone else's MPI code and there are a number of times that variables are declared in main() and used in other functions (some MPI specific). I am new to MPI, but in my programming experience that is normally not supposed to be done. Basically it is difficult for me to determine if it is safe to do this (no errors are thrown).
The entire code is quite long so I will just give a simplified version below:
int main(int argc, char** argv) {
// ...unrelated code
int num_procs, local_rank, name_len;
MPI_Comm comm_new;
MPI_Init(&argc, &argv);
MPI_Get_processor_name(proc_name, &name_len);
create_ring_topology(&comm_new, &local_rank, &num_procs);
// ...unrelated code
MPI_Comm_free(&comm_new);
MPI_Finalize();
}
void create_ring_topology(MPI_Comm* comm_new, int* local_rank, int* num_procs) {
MPI_Comm_size(MPI_COMM_WORLD, num_procs);
int dims[1], periods[1];
int dimension = 1;
dims[0] = *num_procs;
periods[0] = 1;
int* local_coords = malloc(sizeof(int)*dimension);
MPI_Cart_create(MPI_COMM_WORLD, dimension, dims, periods, 0, comm_new);
MPI_Comm_rank(*comm_new, local_rank);
MPI_Comm_size(*comm_new, num_procs);
MPI_Cart_coords(*comm_new, *local_rank, dimension, local_coords);
sprintf(s_local_coords, "[%d]", local_coords[0]);
}
That's just regular pointer usage. Nothing wrong with that.
The variables are declared in main and remain in-scope until main returns, i.e. almost for the duration of the program.
Note that MPI does not actually add anything to C. All it is is an extra library. It does not extend the language.