sending c struct via MPI fails partially - c

I am sending a (particle) struct using the MPI_Type_create_struct() as done e.g. here, or explained in detail here.
I'm collecting all particles which are going to a specific proc, memcpy() them into the send buffer and MPI_Isend() them.
So far, so good. MPI_Iprob()'ing for the message gives me the right count of particles sent.
So I MPI_Recv() the buffer and extract the data (now even by copying the struct one by one). No matter how many particles I send, only the first particles' data are correct.
There are three possible mistakes:
The MPI_Type_create_struct() doesn't create a proper map of my struct, due to my usage of offset of() like in the first link. Maybe my struct contains a non visible padding as explained in the second link.
I'm doing some simple mistakes while copying particles into the send buffer and from the receive buffer back (I do print the send buffer - and it works - but maybe I'm overlooking something)
Something totally different.
(sorry for the really ugly presentation of the code, I could not manage to present it in a descent way. You'll find the code here - the line is already marked - on Github, too!)
Here are the construction of the mpi datatype,
typedef struct {
int ID;
double x[DIM];
} pchase_particle_t;
const int items = 2;
int block_lengths[2] = {1, DIM};
MPI_Datatype mpi_types[2] = {MPI_INT, MPI_DOUBLE};
MPI_Aint offsets[2];
offsets[0] = offsetof(pchase_particle_t, ID);
offsets[1] = offsetof(pchase_particle_t, x);
MPI_Type_create_struct(items, block_lengths, offsets, mpi_types, &W->MPI_Particle);
MPI_Type_commit(&W->MPI_Particle);
the sending
/* handle all mpi send/recv status data */
MPI_Request *send_request = P4EST_ALLOC(MPI_Request, W->p4est->mpisize);
MPI_Status *recv_status = P4EST_ALLOC(MPI_Status, W->p4est->mpisize);
/* setup send/recv buffers */
pchase_particle_t **recv_buf = P4EST_ALLOC(pchase_particle_t *, num_senders);
pchase_particle_t **send_buf = P4EST_ALLOC(pchase_particle_t *, num_receivers);
int recv_count = 0, recv_length, flag, j;
/* send all particles to their belonging procs */
for (i = 0; i < num_receivers; i++) {
/* resolve particle list for proc i */
sc_list_t *tmpList = *((sc_list_t **) sc_array_index(W->particles_to, receivers[i]));
pchase_particle_t * tmpParticle;
int send_count = 0;
/* get space for the particles to be sent */
send_buf[i] = P4EST_ALLOC(pchase_particle_t, tmpList->elem_count);
/* copy all particles into the send buffer and remove them from this proc */
while(tmpList->first != NULL){
tmpParticle = sc_list_pop(tmpList);
memcpy(send_buf[i] + send_count * sizeof(pchase_particle_t), tmpParticle, sizeof(pchase_particle_t));
/* free particle */
P4EST_FREE(tmpParticle);
/* update particle counter */
send_count++;
}
/* print send buffer */
for (j = 0; j < send_count; j++) {
pchase_particle_t *tmpParticle = send_buf[i] + j * sizeof(pchase_particle_t);
printf("[pchase %i sending] particle[%i](%lf,%lf)\n", W->p4est->mpirank, tmpParticle->ID, tmpParticle->x[0], tmpParticle->x[1]);
}
printf("[pchase %i sending] particle count: %i\n", W->p4est->mpirank, send_count);
/* send particles to right owner */
mpiret = MPI_Isend(send_buf[i], send_count, W->MPI_Particle, receivers[i], 13, W->p4est->mpicomm, &send_request[i]);
SC_CHECK_MPI(mpiret);
}
and the receiving.
recv_count = 0;
/* check for messages until all arrived */
while (recv_count < num_senders) {
/* probe if any of the sender has already sent his message */
for (i = 0; i < num_senders; i++) {
MPI_Iprobe(senders[i], MPI_ANY_TAG, W->p4est->mpicomm,
&flag, &recv_status[i]);
if (flag) {
/* resolve number of particles receiving */
MPI_Get_count(&recv_status[i], W->MPI_Particle, &recv_length);
printf("[pchase %i receiving message] %i particles arrived from sender %i with tag %i\n",
W->p4est->mpirank, recv_length, recv_status[i].MPI_SOURCE, recv_status[i].MPI_TAG);
/* get space for the particles to be sent */
recv_buf[recv_count] = P4EST_ALLOC(pchase_particle_t, recv_length);
/* receive a list with recv_length particles */
mpiret = MPI_Recv(recv_buf[recv_count], recv_length, W->MPI_Particle, recv_status[i].MPI_SOURCE,
recv_status[i].MPI_TAG, W->p4est->mpicomm, &recv_status[i]);
SC_CHECK_MPI(mpiret);
/*
* insert all received particles into the
* push list
*/
pchase_particle_t *tmpParticle;
for (j = 0; j < recv_length; j++) {
/*
* retrieve all particle details from
* recv_buf
*/
tmpParticle = recv_buf[recv_count] + j * sizeof(pchase_particle_t);
pchase_particle_t *addParticle = P4EST_ALLOC(pchase_particle_t,1);
addParticle->ID=tmpParticle->ID;
addParticle->x[0] = tmpParticle->x[0];
addParticle->x[1] = tmpParticle->x[1];
printf("[pchase %i receiving] particle[%i](%lf,%lf)\n",
W->p4est->mpirank, addParticle->ID, addParticle->x[0], addParticle->x[1]);
/* push received particle to push list and update world counter */
sc_list_append(W->particle_push_list, addParticle);
W->n_particles++;
}
/* we received another particle list */
recv_count++;
}
}
}
edit: reindented..
edit: Only the first particles' data is correct, means that all it's properties (ID and coordinates) are identical to that of the sent particle. The others however are initialized with zeros i.e. ID=0, x[0]=0.0, x[1]=0.0. Maybe that's a hint for the solution.

There is an error in your pointer arithmetic. send_buf[i] is already of type pchase_particle_t * and therefore send_buf[i] + j * sizeof(pchase_particle_t) does not point to the j-th element of the i-th buffer but rather to the j * sizeof(pchase_particle_t)-th element. Thus your particles are not stored contiguously in memory but rather separated by sizeof(pchase_particle_t) - 1 empty array elements. These get sent instead of the correct particles because the MPI_Send call accesses buffer memory contiguously. The same applies to the code of the receiver.
You do not see the error in the sender code because your debug print uses the same incorrect pointer arithmetic and hence accesses memory using the same stride. I guess your send counts are small and you get memory allocated on the data segment heap, otherwise you should have received SIGSEGV for out-of-bound array access very early in the data packing process (e.g. in the memcpy part).
Resolution: do not multiply the array index by sizeof(pchase_particle_t).

Related

A lot of 0's received when using cudaMemcpy()

I've just started to learn CUDA and i wanted to fill an array (a 2D array represented as a 1D array) with random numbers. I followed another posts in order to generate random numbers, but i don't know if there is a problem with the generation of numbers or with the memory recovering from the device or anything else. The problem is that, though i have tried to fill any cell of the array with the id of the thread that is atending it in order to see the results after copying into the host memory, i receive an array that is filled with 0 in any position after recovering the data with cudaMemcpy().
I'm programming on Visual Studio 2013, with cuda 7.5, on a i5 2500k as my processor and a 960 GTX graphic card.
Here is the main and the method where i try to fill it. I'll update the cuRand Initialization too. If you need to see something else, just tell me.
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void poblar(int * adn, curandState * state){
curandState localState = state[threadIdx.x];
int random = curand(&localState);
adn[threadIdx.x] = random;
// It doesn't mind if i use the following instruction, the result is a lot of 0's
//adn[threadIdx.x] = threadIdx.x;
}
int main()
{
const int adnLength = NUMCROMOSOMAS * SIZECROMOSOMAS; // 256 * 128 (32.768)
const size_t adnSize = adnLength * sizeof(int);
int adnCPU[adnLength];
int * adnDevice;
cudaError_t error = cudaSetDevice(0);
if (error != cudaSuccess)
exit(-EXIT_FAILURE);
curandState * randState;
error = cudaMalloc(&randState, adnLength * sizeof(curandState));
if (error != cudaSuccess){
cudaFree(randState);
exit(-EXIT_FAILURE);
}
//Here is initialized cuRand
setup_cuRand <<<1, adnLength >> > (randState, unsigned(time(NULL)));
error = cudaMalloc((void **)&adnDevice, adnSize);
if (error == cudaErrorMemoryAllocation){// cudaSuccess){
cudaFree(adnDevice);
cudaFree(randState);
printf("\n error");
exit(-EXIT_FAILURE);
}
poblar <<<1, adnLength >>> (adnDevice, randState);
error = cudaMemcpy(adnCPU, adnDevice, adnSize, cudaMemcpyDeviceToHost);
//After here, for any i, adnCPU[i] is 0 and i cannot figure what is wrong
if (error == cudaSuccess){
for (int i = 0; i < NUMCROMOSOMAS; i++){
for (int j = 0; j < SIZECROMOSOMAS; j++){
printf("%i,", adnCPU[(i*SIZECROMOSOMAS) + j]);
}
printf("\n");
}
}
return 0;
}
EDIT after answer solved: There was a particularity over the answer given, and is that you need a lower number of threads (half of that quantity worked for me) in order to seed correctly the random numbers with cuRand. For some reason, i could create the threads perfectly but i couldn't seed the pseudo-random algorithm generator.
The maximum number of threads per block is 1024 on your hardware, hence, you may not schedule a call with adnLength if it is larger than 1024.
The error you are having is most probably a call configuration error, and it is returned by cudaPeekAtLastError, as it occurs before any GPU work, right after the triple angled-bracket call. Indeed cudaMemcpy may not return it, even though it returns error from previous asynchronous calls.
The error that may occur is cudaErrorLaunchOutOfResources.

How is this tcp socket code handling the rx buffer?

I came across this tcp server example, provided with the Altera Nios II processor, and I'm not getting the section on handling the rx_buffer.
server.h
typedef struct SSS_SOCKET {
enum {
READY, COMPLETE, CLOSE
} state;
int fd;
int close;
INT8U rx_buffer[SSS_RX_BUF_SIZE];
INT8U *rx_rd_pos; /* position we've read up to */
INT8U *rx_wr_pos; /* position we've written up to */
} SSSConn;
server.c
int data_used = 0, rx_code = 0;
INT8U *lf_addr;
conn->rx_rd_pos = conn->rx_buffer;
conn->rx_wr_pos = conn->rx_buffer;
printf("[sss_handle_receive] processing RX data\n");
while (conn->state != CLOSE) {
/* Find the Carriage return which marks the end of the header */
lf_addr = strchr(conn->rx_buffer, '\n');
if (lf_addr) {
/* go off and do whatever the user wanted us to do */
sss_exec_command(conn);
}
/* No newline received? Then ask the socket for data */
else {
rx_code = recv(conn->fd, conn->rx_wr_pos,
SSS_RX_BUF_SIZE - (conn->rx_wr_pos - conn->rx_buffer) -1, 0);
if (rx_code > 0) {
conn->rx_wr_pos += rx_code;
/* Zero terminate so we can use string functions */
*(conn->rx_wr_pos + 1) = 0;
}
}
/*
* When the quit command is received, update our connection state so that
* we can exit the while() loop and close the connection
*/
conn->state = conn->close ? CLOSE : READY;
/* Manage buffer */
data_used = conn->rx_rd_pos - conn->rx_buffer;
memmove(conn->rx_buffer, conn->rx_rd_pos,
conn->rx_wr_pos - conn->rx_rd_pos);
conn->rx_rd_pos = conn->rx_buffer;
conn->rx_wr_pos -= data_used;
memset(conn->rx_wr_pos, 0, data_used);
}
Specifically, I don't see the purpose of the data_used variable. rx_rd_pos is pointing to rx_buffer and there doesn't appear to be an operation on either, so how will they be different? In fact, the only thing that seems to happen under Manage buffer is the copying of data into rx_buffer. I'm sure I'm missing something simple, but I can't seem to see it.
Thanks for any help in advance.
Edit: Here's the sss_exec_command() function.
void sss_exec_command(SSSConn* conn) {
int bytes_to_process = conn->rx_wr_pos - conn->rx_rd_pos;
INT8U tx_buf[SSS_TX_BUF_SIZE];
INT8U *tx_wr_pos = tx_buf;
INT8U error_code;
/*
* "SSSCommand" is declared static so that the data will reside
* in the BSS segment. This is done because a pointer to the data in
* SSSCommand
* will be passed via SSSLedCommandQ to the LEDManagementTask.
* Therefore SSSCommand cannot be placed on the stack of the
* SSSSimpleSocketServerTask, since the LEDManagementTask does not
* have access to the stack of the SSSSimpleSocketServerTask.
*/
static INT32U SSSCommand;
SSSCommand = CMD_LEDS_BIT_0_TOGGLE;
while (bytes_to_process--) {
SSSCommand = toupper(*(conn->rx_rd_pos++));
if (SSSCommand >= ' ' && SSSCommand <= '~') {
tx_wr_pos += sprintf(tx_wr_pos,
"--> Simple Socket Server Command %c.\n",
(char) SSSCommand);
if (SSSCommand == CMD_QUIT) {
tx_wr_pos += sprintf(tx_wr_pos,
"Terminating connection.\n\n\r");
conn->close = 1;
} else {
error_code = OSQPost(SSSLEDCommandQ, (void *) SSSCommand);
alt_SSSErrorHandler(error_code, 0);
}
}
}
send(conn->fd, tx_buf, tx_wr_pos - tx_buf, 0);
return;
}
Answers below are correct. I missed the pointer arithmetic on rx_rd in the command function :P
That section removes data from the buffer once it has been processed. The code you posted never uses the data stores in the buffer, but the sss_exec_command function will, after a newline is received. That function is passed the connection, so it can increment the read position by however much it uses.
After data is used, the buffer management section reclaims the space. The amount of data left in the buffer is the difference between the write and read positions. This much data is moved from the write position to the start of the buffer, then the read and write pointer are updated to their new positions. The read position is set to the start of the buffer, and the write position is decremented by data_used, which is the original difference between the start of the buffer and the read pointer, i.e. the amount of data used.
Assuming the code actually works, then data_used = conn->rx_rd_pos - conn->rx_buffer implies rx_rd_pos is being changed; this would be being changed when the code has consumed the data written into the buffer (it's written in at rx_wr_pos and consumed from rx_rd_pos). This would imply that sss_exec_command(conn) is adjusting conn. Is that the case?

C, Open MPI: segmentation fault from call to MPI_Finalize(). Segfault does not always happen, especially with low numbers of processes

I am writing a simple code to learn how to define an MPI_Datatype and use it in conjunction with MPI_Gatherv. I wanted to make sure I could combine variable length, dynamically allocated arrays of structured data on a process, which seems to be working fine, up until my call to MPI_Finalize(). I have confirmed that this is where the problem starts to manifest itself by using print statements and the Eclipse PTP debugger (backend is gdb-mi). My main question is, how can I get rid of the segmentation fault?
The segfault does not occur every time I run the code. For instance, it hasn't happened for 2 or 3 processes, but tends to happen regularly when I run with about 4 or more processes.
Also, when I run this code with valgrind, the segmentation fault does not occur. However, I do get error messages from valgrind, though the output is difficult for me to understand when I use MPI functions, even with a large number of targeted suppressions. I am also concerned that if I use more suppressions, I will silence a useful error message.
I compile the normal code using these flags, so I am using the C99 standard in both cases:
-ansi -pedantic -Wall -O2 -march=barcelona -fomit-frame-pointer -std=c99
and the debugged code with:
-ansi -pedantic -std=c99 -Wall -g
Both use the gcc 4.4 mpicc compiler, and are run on a cluster using Red Hat Linux with Open MPI v1.4.5. Please let me know if I have left out other important bits of information. Here is the code, and thanks in advance:
//#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
//#include <limits.h>
#include "mpi.h"
#define FULL_PROGRAM 1
struct CD{
int int_ID;
double dbl_ID;
};
int main(int argc, char *argv[]) {
int numprocs, myid, ERRORCODE;
#if FULL_PROGRAM
struct CD *myData=NULL; //Each process contributes an array of data, comprised of 'struct CD' elements
struct CD *allData=NULL; //root will dynamically allocate this array to store all the data from rest of the processes
int *p_lens=NULL, *p_disp=NULL; //p_lens stores the number of elements in each process' array, p_disp stores the displacements in bytes
int MPI_CD_size; //stores the size of the MPI_Datatype that is defined to allow communication operations using 'struct CD' elements
int mylen, total_len=0; //mylen should be the length of each process' array
//MAXlen is the maximum allowable array length
//total_len will be the sum of mylen across all processes
// ============ variables related to defining new MPI_Datatype at runtime ====================================================
struct CD sampleCD = {.int_ID=0, .dbl_ID=0.0};
int blocklengths[2]; //this describes how many blocks of identical data types will be in the new MPI_Datatype
MPI_Aint offsets[2]; //this stores the offsets, in bytes(bits?), of the blocks from the 'start' of the datatype
MPI_Datatype block_types[2]; //this stores which built-in data types the blocks are comprised of
MPI_Datatype myMPI_CD; //just the name of the new datatype
MPI_Aint myStruct_address, int_ID_address, dbl_ID_address, int_offset, dbl_offset; //useful place holders for filling the arrays above
// ===========================================================================================================================
#endif
// =================== Initializing MPI functionality ============================
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
// ===============================================================================
#if FULL_PROGRAM
// ================== This part actually formally defines the MPI datatype ===============================================
MPI_Get_address(&sampleCD, &myStruct_address); //starting point of struct CD
MPI_Get_address(&sampleCD.int_ID, &int_ID_address); //starting point of first entry in CD
MPI_Get_address(&sampleCD.dbl_ID, &dbl_ID_address); //starting point of second entry in CD
int_offset = int_ID_address - myStruct_address; //offset from start of first to start of CD
dbl_offset = dbl_ID_address - myStruct_address; //offset from start of second to start of CD
blocklengths[0]=1; blocklengths[1]=1; //array telling it how many blocks of identical data types there are, and the number of entries in each block
//This says there are two blocks of identical data-types, and both blocks have only one variable in them
offsets[0]=int_offset; offsets[1]=dbl_offset; //the first block starts at int_offset, the second block starts at dbl_offset (from 'myData_address'
block_types[0]=MPI_INT; block_types[1]=MPI_DOUBLE; //the first block contains MPI_INT, the second contains MPI_DOUBLE
MPI_Type_create_struct(2, blocklengths, offsets, block_types, &myMPI_CD); //this uses the above arrays to define the MPI_Datatype...an MPI-2 function
MPI_Type_commit(&myMPI_CD); //this is the final step to defining/reserving the data type
// ========================================================================================================================
mylen = myid*2; //each process is told how long its array should be...I used to define that randomly but that just makes things messier
p_lens = (int*) calloc((size_t)numprocs, sizeof(int)); //allocate memory for the number of elements (p_lens) and offsets from the start of the recv buffer(d_disp)
p_disp = (int*) calloc((size_t)numprocs, sizeof(int));
myData = (struct CD*) calloc((size_t)mylen, sizeof(struct CD)); //allocate memory for each process' array
//if mylen==0, 'a unique pointer to the heap is returned'
if(!p_lens) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
if(!p_disp) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
if(!myData) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
for (int k=0; k<numprocs; ++k) {
if(myid==k) {
//printf("\t ID %d has %d entries: { ", myid, mylen);
for(int i=0; i<mylen; ++i) {
myData[i]= (struct CD) {.int_ID=myid*(i+1), .dbl_ID=myid*(i+1)}; //fills data elements with simple pattern
//printf("%d: (%d,%lg) ", i, myData[i].int_ID, myData[i].dbl_ID);
}
//printf("}\n");
}
}
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
MPI_Gather(&mylen, 1, MPI_INT, p_lens, 1, MPI_INT, 0, MPI_COMM_WORLD); //Each process sends root the length of the vector they'll be sending
#if 1
MPI_Type_size(myMPI_CD, &MPI_CD_size); //gets the size of the MPI_Datatype for p_disp
#else
MPI_CD_size = sizeof(struct CD); //using this doesn't change things too much...
#endif
for(int j=0;j<numprocs;++j) {
total_len += p_lens[j];
if (j==0) { p_disp[j] = 0; }
else { p_disp[j] = p_disp[j-1] + p_lens[j]*MPI_CD_size; }
}
if (myid==0) {
allData = (struct CD*) calloc((size_t)total_len, sizeof(struct CD)); //allocate array
if(!allData) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
}
MPI_Gatherv(myData, mylen, myMPI_CD, allData, p_lens, p_disp, myMPI_CD, 0, MPI_COMM_WORLD); //each array sends root process their array, which is stored in 'allData'
// ============================== OUTPUT CONFIRMING THAT COMMUNICATIONS WERE SUCCESSFUL=========================================
if(myid==0) {
for(int i=0;i<numprocs;++i) {
printf("\n\tElements from %d on MASTER are: { ",i);
for(int k=0;k<p_lens[i];++k) { printf("%d: (%d,%lg) ", k, (allData+p_disp[i]+k)->int_ID, (allData+p_disp[i]+k)->dbl_ID); }
if(p_lens[i]==0) printf("NOTHING ");
printf("}\n");
}
printf("\n"); //each data element should appear as two identical numbers, counting upward by the process ID
}
// ==========================================================================================================
if (p_lens) { free(p_lens); p_lens=NULL; } //adding this in didn't get rid of the MPI_Finalize seg-fault
if (p_disp) { free(p_disp); p_disp=NULL; }
if (myData) { free(myData); myData=NULL; }
if (allData){ free(allData); allData=NULL; } //the if statement ensures that processes not allocating memory for this pointer don't free anything
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
printf("ID %d: I have reached the end...before MPI_Type_free!\n", myid);
// ====================== CLEAN UP ================================================================================
ERRORCODE = MPI_Type_free(&myMPI_CD); //this frees the data type...not always necessary, but a good practice
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
if(ERRORCODE!=MPI_SUCCESS) { printf("ID %d...MPI_Type_free was not successful\n", myid); MPI_Abort(MPI_COMM_WORLD, 911); exit(EXIT_FAILURE); }
else { printf("ID %d...MPI_Type_free was successful, entering MPI_Finalize...\n", myid); }
#endif
ERRORCODE=MPI_Finalize();
for(double temp=0.0;temp<1e7;++temp) temp += exp(-10.0); //NO MPI_Barrier AFTER MPI_Finalize!
if(ERRORCODE!=MPI_SUCCESS) { printf("ID %d...MPI_Finalize was not successful\n", myid); MPI_Abort(MPI_COMM_WORLD, 911); exit(EXIT_FAILURE); }
else { printf("ID %d...MPI_Finalize was successful\n", myid); }
return EXIT_SUCCESS;
}
The outer loop on k is bogus, but is not technically wrong -- it's just useless.
The real issue is that your displacements to MPI_GATHERV are wrong. If you run through valgrind, you'll see something like this:
==28749== Invalid write of size 2
==28749== at 0x4A086F4: memcpy (mc_replace_strmem.c:838)
==28749== by 0x4C69614: unpack_predefined_data (datatype_unpack.h:41)
==28749== by 0x4C6B336: ompi_generic_simple_unpack (datatype_unpack.c:418)
==28749== by 0x4C7288F: ompi_convertor_unpack (convertor.c:314)
==28749== by 0x8B295C7: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:216)
==28749== by 0x935723C: mca_btl_sm_component_progress (btl_sm_component.c:426)
==28749== by 0x51D4F79: opal_progress (opal_progress.c:207)
==28749== by 0x8B225CA: opal_condition_wait (condition.h:99)
==28749== by 0x8B22718: ompi_request_wait_completion (request.h:375)
==28749== by 0x8B231E1: mca_pml_ob1_recv (pml_ob1_irecv.c:104)
==28749== by 0x955E7A7: mca_coll_basic_gatherv_intra (coll_basic_gatherv.c:85)
==28749== by 0x9F7CBFA: mca_coll_sync_gatherv (coll_sync_gatherv.c:46)
==28749== Address 0x7b1d630 is not stack'd, malloc'd or (recently) free'd
Indicating that MPI_GATHERV was given bad information somehow.
(there are other valgrind warnings that come from libltdl inside Open MPI which are unfortunately unavoidable -- it's a bug in libltdl, and another from PLPA, which is also unfortunately unavoidable because it's intentionally doing that [for reasons that aren't interesting to discuss here])
Looking at your displacements computation, I see
total_len += p_lens[j];
if (j == 0) {
p_disp[j] = 0;
} else {
p_disp[j] = p_disp[j - 1] + p_lens[j] * MPI_CD_size;
}
But MPI gather displacements are in units of datatypes, not bytes. So it really should be:
p_disp[j] = total_len;
total_len += p_lens[j];
Making this change made the MPI_GATHERV valgrind warning go away for me.
This outer on 'k' loop is just bogus. It's body is only executed for k=myid (which is a constant for every running process). The k is never referenced inside the loop (except the comparison with the almost-constant myid).
Also, the line with mylen = myid*2; is frowned upon. I suggest you change it to a constant.
for (int k=0; k<numprocs; ++k) {
if(myid==k) {
//printf("\t ID %d has %d entries: { ", myid, mylen);
for(int i=0; i<mylen; ++i) {
myData[i]= (struct CD) {.int_ID=myid*(i+1), .dbl_ID=myid*(i+1)}; //fills data elements with simple pattern
//printf("%d: (%d,%lg) ", i, myData[i].int_ID, myData[i].dbl_ID);
}
//printf("}\n");
}
}
, so (given that myid is between 0 and numprocs) this whole silly construct can be reduced to:
for(int i=0; i<mylen; ++i) {
myData[i].int_ID=myid*(i+1);
myData[i].dbl_ID=myid*(i+1);
}

I want to know cache line size and what size is patched when write into array?

I want to implement an optimized queue between threads. To increase performance, I want to use pipeline techniques by splitting queue size.
I have a large queue for communication between two threads, one called producer, and another called consumer. By splitting queue size, if the producer writes in one part of the queue, the consumer can read the part that was written by producer. And when the consumer is reading a part of queue, the producer can write in the other part.
But I think when cache read array (because queue is made by array), the size doesn't same cache line size..
So I want to know what the size when cache bring array to write or read data.
If you're running on Linux, this information is sometimes listed in /proc/cpuinfo as cache_alignment.
You could also find this information indirectly by stepping through an array, adjusting your stride, and timing the loop. When accesses aren't block aligned you'll see the performance drop, so you can get a pretty good idea of what your block size is. Here's a quick and dirty version to basically do this, I think it'll give you a good idea:
int main () {
int i, STEP_SIZE = 8;
int * a;
struct timeval t1, t2;
double el;
a = (int*)malloc(1024*1024*64*sizeof(int));
for (i = 0; i < 1024*1024*64; i++)
a[i] = 0;
gettimeofday(&t1, NULL);
for (i = 0; i < 1024*1024*64; i += STEP_SIZE)
a[i] += 10;
gettimeofday(&t2, NULL);
el = (t2.tv_sec - t1.tv_sec) * 1000.0;
el += (t2.tv_usec - t1.tv_usec) / 1000.0;
printf("%d %3.2f\n", STEP_SIZE, el);
return 0;
}
Basically you would want to vary STEP_SIZE

How to read back a CUDA Texture for testing?

Ok, so far, I can create an array on the host computer (of type float), and copy it to the gpu, then bring it back to the host as another array (to test if the copy was successful by comparing to the original).
I then create a CUDA array from the array on the GPU. Then I bind that array to a CUDA texture.
I now want to read that texture back and compare with the original array (again to test that it copied correctly). I saw some sample code that uses the readTexel() function shown below. It doesn't seem to work for me... (basically everything works except for the section in the bindToTexture(float* deviceArray) function starting at the readTexels(SIZE, testArrayDevice) line).
Any suggestions of a different way to do this? Or are there some obvious problems I missed in my code?
Thanks for the help guys!
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#define SIZE 20;
//Create a channel description to use.
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
//Create a texture to use.
texture<float, 2, cudaReadModeElementType> cudaTexture;
//cudaTexture.filterMode = cudaFilterModeLinear;
//cudaTexture.normalized = false;
__global__ void readTexels(int amount, float *Array)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < amount)
{
float x = tex1D(cudaTexture, float(index));
Array[index] = x;
}
}
float* copyToGPU(float* hostArray, int size)
{
//Create pointers, one for the array to be on the device, and one for bringing it back to the host for testing.
float* deviceArray;
float* testArray;
//Allocate some memory for the two arrays so they don't get overwritten.
testArray = (float *)malloc(sizeof(float)*size);
//Allocate some memory for the array to be put onto the GPU device.
cudaMalloc((void **)&deviceArray, sizeof(float)*size);
//Actually copy the array from hostArray to deviceArray.
cudaMemcpy(deviceArray, hostArray, sizeof(float)*size, cudaMemcpyHostToDevice);
//Copy the deviceArray back to testArray in host memory for testing.
cudaMemcpy(testArray, deviceArray, sizeof(float)*size, cudaMemcpyDeviceToHost);
//Make sure contents of testArray match the original contents in hostArray.
for (int i = 0; i < size; i++)
{
if (hostArray[i] != testArray[i])
{
printf("Location [%d] does not match in hostArray and testArray.\n", i);
}
}
//Don't forget free these arrays after you're done!
free(testArray);
return deviceArray; //TODO: FREE THE DEVICE ARRAY VIA cudaFree(deviceArray);
}
cudaArray* bindToTexture(float* deviceArray)
{
//Create a CUDA array to translate deviceArray into.
cudaArray* cuArray;
//Allocate memory for the CUDA array.
cudaMallocArray(&cuArray, &cudaTexture.channelDesc, SIZE, 1);
//Copy the deviceArray into the CUDA array.
cudaMemcpyToArray(cuArray, 0, 0, deviceArray, sizeof(float)*SIZE, cudaMemcpyHostToDevice);
//Release the deviceArray
cudaFree(deviceArray);
//Bind the CUDA array to the texture.
cudaBindTextureToArray(cudaTexture, cuArray);
//Make a test array on the device and on the host to verify that the texture has been saved correctly.
float* testArrayDevice;
float* testArrayHost;
//Allocate memory for the two test arrays.
cudaMalloc((void **)&testArray, sizeof(float)*SIZE);
testArrayHost = (float *)malloc(sizeof(float)*SIZE);
//Read the texels of the texture to the test array in the device.
readTexels(SIZE, testArrayDevice);
//Copy the device test array to the host test array.
cudaMemcpy(testArrayHost, testArrayDevice, sizeof(float)*SIZE, cudaMemcpyDeviceToHost);
//Print contents of the array out.
for (int i = 0; i < SIZE; i++)
{
printf("%f\n", testArrayHost[i]);
}
//Free the memory for the test arrays.
free(testArrayHost);
cudaFree(testArrayDevice);
return cuArray; //TODO: UNBIND THE CUDA TEXTURE VIA cudaUnbindTexture(cudaTexture);
//TODO: FREE THE CUDA ARRAY VIA cudaFree(cuArray);
}
int main(void)
{
float* hostArray;
hostArray = (float *)malloc(sizeof(float)*SIZE);
for (int i = 0; i < SIZE; i++)
{
hostArray[i] = 10.f + i;
}
float* deviceAddy = copyToGPU(hostArray, SIZE);
free(hostArray);
return 0;
}
Briefly:
------------- in your main.cu ---------------------------------------------------------------------------------------
-1. Define the texture as a globlal variable
texture refTexture; // global variable !
// meaning: address the texture with (x,y) (2D) and get an unsinged int
In the main function:
-2. Use arrays combined with texture
cudaArray* myArray; // declar.
// ask for memory
cudaMallocArray ( &myArray,
&refTex.channelDesc, /* with this you don't need to fill a channel descriptor */
width,
height);
-3. copy data from CPU to GPU (to the array)
cudaMemcpyToArray ( arrayCudaEntrada, // destination: the array
0, 0, // offsets
sourceData, // pointer uint*
widthheightsizeof(uint), // total amount of bytes to be copied
cudaMemcpyHostToDevice);
-4. bind texture and array
cudaBindTextureToArray( refTex,arrayCudaEntrada)
-5. change some parameters in the texture
refTextura_In.normalized = false; // don't automatically convert fetched data to [0,1[
refTextura_In.addressMode[0] = cudaAddressModeClamp; // if my indexing is out of bounds: automatically use a valid indexing (0 if negative index, last if too great index)
refTextura_In.addressMode[1] = cudaAddressModeClamp;
---------- in the kernel --------------------------------------------------------
// find out indexes (f,c) to process by this thread
uint f = (blockIdx.x * blockDim.x) + threadIdx.x;
uint c = (blockIdx.y * blockDim.y) + threadIdx.y;
// this is curious and necessary: indexes for reading from a texture
// are floats !. Even if you are certain to access (4,5) you have
// match the "center" this is (4.5, 5.5)
uint read = tex2D( refTex, c+0.5f, f+0.5f); // texRef is a global variable
Now You process read and write the results to other zone of the device global
memory, not to the texture itself !
readTexels() is a kernel (__global__) function, i.e. it runs on the GPU. Therefore you need to use the correct syntax to launch a kernel.
Take a look through the CUDA Programming Guide and some of the SDK samples, both available via the NVIDIA CUDA site to see how to launch a kernel.
Hint: It'll end up something like readTexels<<<grid,block>>>(...)

Resources