I've got a problem with CUDA. I want to make small program which count letters from array of char.
I read letters from file and save to int variable called N, how many letters read. After that I malloc.
char *b_h, *b_d;
size_t size_char = N * sizeof(char);
b_h = (char *)malloc(size_char);
After malloc I read file again and assign current letter to element of char array (it works):
int j=0;
while(fscanf(file,"%c",&l)!=EOF)
{
b_h[j]=l;
j++;
}
After that I create an int variable (a_h) as counter.
int *a_h, *a_d;
size_t size_count = 1*sizeof(int);
a_h = (int *)malloc(size_count);
Ok, go with CUDA:
cudaMalloc((void **) &a_d, size_count);
cudaMalloc((void **) &b_d, size_char);
Copy from host to device:
cudaMemcpy(a_d, a_h, size_count, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size_char, cudaMemcpyHostToDevice);
Set blocks and call CUDA function:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d,b_d,c_d, N);
Receive from function:
cudaMemcpy(a_h, a_d, size_count, cudaMemcpyDeviceToHost);
cudaMemcpy(b_h, d_d, size_char, cudaMemcpyDeviceToHost);
And print count:
printf("\Count: %d\n", a_h[0]);
And it doesn't work. In array of char I have sentence: Super testSuper test ; I'm looking for 'e' letter and I got a_h[0] = 1.
Where is problem?
CUDA function:
__global__ void square_array(int *a, char *b, int *c, int N)
{
const char* letter = "e";
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N)
{
if(b[idx] == *letter)
{
a[0]++;
}
}
}
Please, help me.
I'm guessing that N is small enough that your GPU is able to launch all your threads in parallel. So, you start a thread for each character in your array. The threads, all running simultaneously, don't see the output from each other. Instead, each thread reads the value of a[0] (which is 0), and increases it by 1 and stores the resulting value (1). If this is homework, that would have been the basic lesson that the professor wanted to impart.
When multiple threads store a value in the same location simultaneously, it is undefined which thread will get its value stored. In your case, that doesn't matter because all threads that store a value will store the value, "1".
A typical solution would be to have each thread store a value of 0 or 1 in a separate location (depending on if there is a match or not), and then add up the values in a separate step.
You can also use an atomic increase operation.
Related
I store my data in a char array, and I need to read float and int variables from there.
This code works fine on CPU:
global float *p;
p = (global float*)get_pointer_to_the_field(char_array, index);
*p += 10;
But on GPU I get the error -5: CL_OUT_OF_RESOURCES. The reading itself works, but doing something with the value (adding 10 in this case) causes the error. How could I fix it?
Update:
This works on GPU:
float f = *p;
f += 10;
However, I still can't write this value back to the array.
Here is the kernel:
global void write_value(global char *data, int tuple_pos, global char *field_value,
int which_field, global int offsets[], global int *num_of_attributes) {
int tuple_size = offsets[*num_of_attributes];
global char *offset = data + tuple_pos * tuple_size;
offset += offsets[which_field];
memcpy(offset, field_value, (offsets[which_field+1] - offsets[which_field]));
}
global char *read_value(global char *data, int tuple_pos,
int which_field, global int offsets[], global int *num_of_attributes) {
int tuple_size = offsets[*num_of_attributes];
global char *offset = data + tuple_pos * tuple_size;
offset += offsets[which_field];
return offset;
}
kernel void update_single_value(global char* input_data, global int* pos, global int offsets[],
global int *num_of_attributes, global char* types) {
int g_id = get_global_id(1);
int attr_id = get_global_id(0);
int index = pos[g_id];
if (types[attr_id] == 'f') { // if float
global float *p;
p = (global float*)read_value(input_data, index, attr_id, offsets, num_of_attributes);
float f = *p;
f += 10;
//*p += 10; // not working on GPU
}
else if (types[attr_id] == 'i') { // if int
global int *p;
p = (global int*)read_value(input_data, index, attr_id, offsets, num_of_attributes);
int i = *p;
i += 10;
//*p += 10;
}
else { // if char
write_value(input_data, index, read_value(input_data, index, attr_id, offsets, num_of_attributes), attr_id, offsets, num_of_attributes);
}
}
It updates values of a table's tuples, int and float are increased by 10, char fields are just replaced with the same content.
Are you enabling the byte_addressable_store extension? As far as I'm aware, bytewise writes to global memory aren't well-defined in OpenCL unless you enable this. (You'll need to check if the extension is supported by your implementation.)
You might also want to consider using the "correct" type in the kernel argument - this might help the compiler produce more efficient code. If the type can vary dynamically, you could perhaps try using a union type (or union fields in a struct type), although I haven't tested this with OpenCL myself.
It turned out that the problem occurs because the int and float values in the char array aren't 4 bytes aligned. When I'm doing writes to addresses like
offset = data + tuple_pos*4; // or 8, 16 etc
everything works fine. However, the following causes the error:
offset = data + tuple_pos*3; // or any other number not divisible by 4
This means that either I should change the whole design and store the values somehow else, or add "empty" bytes to the char array to make int and float values 4 bytes aligned (which isn't a really good solution).
I'm trying to write the values from the string that is read from stdin directly into the array, but I get a segmentation fault. Being that the array is declared after I read N and M, the memory should already be allocated, right?
int main()
{
long long N;
long long M;
scanf("%lld%lld",&N,&M);
char line[M];
long long map[N][M];
for (long long i=0; i<M; i++)
{
scanf("%s", &line);
buildMap(&map, i, &line);
}
for (long long i=0; i<N; i++)
for (long long j=0; j<M; j++)
printf(&map);
}
void buildMap(long long **map, long long i, char * line)
{
for (long long j=0; j<strlen(line); j++)
{
map[i][j] = line[j]-'0';
}
I have read your codes, and I assume you are attempting to build a 2D map via user input, which is a string (named "Line" in your code) that should only contains numbers from 0 to 9. Numbers from 0 to 9 may represent different elements of the map. Am I guessing right?
I copied and modified your code, and finally I managed to get a result like this:
program screenshot
If I am guessing right, let me first explain the reasons why your code can not be successfully complied.
long long M; char line[M];
In here you have used a variable to declare the size of an array. This syntax works in some other programming languages, but not in C. In C, when compling the source code, the compiler must know exactly how much stack memory space to allocate for each function (main() function in your case). Since the complier does not know how large the array is when it is trying to complie your code, you get a compling failure.
One common solution is that, instead of storing array in stack, we choose to store array in heap, because the heap memory is dynamically allocated and released when the program is running. In other words, you can decide how much memory to allocate after you get the user input. Function malloc() and free() are used for this kind of operation.
Another problem is using "long long **map". Though it will not cause complie failure, it won't give you the expected result either. When the M (array width) of the array is a known constant value, we always perfer using "long long map[][M]" as the parameter. However, in your case, with M being unkown, the common solution is to manually calculate the target location, since the elements in an array are always stored in a linear order in memory, regardless of the array demension.
I have fixed the aforementioned two problems, and I am pasting the modified source code below, which has been successfully complied:
#include <malloc.h>
#include <string.h>
void buildMap(int *map, int i, char * line);
int main()
{
int N;
int M;
scanf("%d%d", &N, &M);
/*Since M (available memory space for "Line") is set by user, we need to build
"szSafeFormat" to restrict the user's input when typing the "Line". Assuming M
is set to 8, then "szSafeFormat" will look like "%7s". With the help of
"szSafeFormat", the scanf function will be scanf("%7s", Line), ignoring
characters after offset 7.*/
char szSafeFormat[256] = { 0 };
sprintf(szSafeFormat, "%%%ds", M - 1);
//char line[M];
char *Line = (char *)malloc(sizeof(char) * M); //raw user input
char *pszValidInput = (char *)malloc(sizeof(char) * M); //pure numbers
//long long map[N][M];
int *pnMap = (int *)malloc(sizeof(int) * M * N);
memset(pnMap, 0xFF, M * N * sizeof(int)); //initialize the Map with 0xFF
for (int i = 0; i < /*M*/N; i++)
{
scanf(szSafeFormat, Line); //get raw user input
sscanf(Line, "%[0-9]", pszValidInput); //only accept the numbers
while (getchar() != '\n'); //empty the stdin buffer
buildMap((int *)(pnMap + i * M), i, pszValidInput);
}
printf("\r\n\r\n");
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
//if the memory content is not 0xFF (means it's a valid value), then print
if (*(pnMap + i * M + j) != 0xFFFFFFFF)
{
printf("%d", *(pnMap + i * M + j));
}
}
printf("\r\n");
}
free(Line);
free(pszValidInput);
free(pnMap);
return 0;
}
void buildMap(int *map, int i, char * line)
{
for (int j = 0; j < strlen(line); j++)
{
(int) *((int *)map + j) = line[j] - '0';
}
}
I used type "int" instead of "long long", but there should not be any problems if you insist to continue using "long long". If you continue to use "long long", the condition while printing out the array values should be changed from:
if (*(pnMap + i * M + j) != 0xFFFFFFFF)
to
if (*(pnMap + i * M + j) != 0xFFFFFFFFFFFFFFFF)
There are also some other modifications regarding user input validation, with which I have written some addtional comments in the code.
Remember that C supports variable-length arrays (something which you already use). That means you can actually pass the dimensions as arguments to the function and use them in the declaration of the array argument. Perhaps something like
void buildMap(const size_t N, const size_t M, long long map[N][M], long long i, char * line) { ... }
Call like
buildMap(N, M, map, i, line);
Note that I have changed the type of N and M to size_t, which is the correct type to use for variable-length array dimensions. You should update the variable-declarations accordingly as well as use "%zu for the scanf format string.
Note that in the call to buildMap I don't use the address-of operator for the arrays. That's because arrays naturally decays to pointers to their first element. Passing e.g. &line is semantically incorrect as it would pass something of type char (*)[M] to the function, not a char *.
i just learned GPU programming and now i have a task to find a minimum value from 100x100 matrix by doing parallel at CUDA. i have try this code, but it's not showing the answer, instead of showing my initiate value hmin = 9999999.can anyone give me the right code? oh, the code is in C lang.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define size (100*100)
//Kernel Functions & Variable
__global__ void FindMin(int* mat[100][100],int* kmin){
int b=blockIdx.x+threadIdx.x*blockDim.x;
int k=blockIdx.y+threadIdx.y*blockDim.y;
if(mat[b][k] < kmin){
kmin = mat[b][k];
}
}
int main(int argc, char *argv[]) {
//Declare Variabel
int i,j,hmaks=0,hmin=9999999,hsumin,hsumax; //Host Variable
int *da[100][100],*dmin,*dmaks,*dsumin,*dsumax; // Device Variable
FILE *baca; //for opening txt file
char buf[4]; //used for fscanf
int ha[100][100],b; //matrix shall be filled by "b"
//1: Read txt File
baca=fopen("MatrixTubes1.txt","r");
if (!baca){
printf("Hey, it's not even exist"); //Checking File, is it there?
}
i=0;j=0; //Matrix index initialization
if(!feof(baca)){ //if not end of file then do
for(i = 0; i < 100; i++){
for(j = 0; j < 100; j++){
fscanf(baca,"%s",buf); //read max 4 char
b=atoi(buf); //parsing from string to integer
ha[i][j]=b; //save it to my matrix
}
}
}
fclose(baca);
//all file has been read
//time to close the file
//Sesi 2: Allocation data di GPU
cudaMalloc((void **)&da, size*sizeof(int));
cudaMalloc((void **)&dmin, sizeof(int));
cudaMalloc((void **)&dmaks, sizeof(int));
cudaMalloc((void **)&dsumin, sizeof(int));
cudaMalloc((void **)&dsumax, sizeof(int));
//Sesi 3: Copy data to Device
cudaMemcpy(da, &ha, size*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dmin, &hmin, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dmaks, &hmaks, sizeof(int), cudaMemcpyHostToDevice);
//Sesi 4: Call Kernel
FindMin<<<100,100,1>>>(da,dmin);
//5: Copy from Device to Host
cudaMemcpy(&hmin, dmin, sizeof(int), cudaMemcpyDeviceToHost);
//6: Print that value
printf("Minimum Value = %i \n",hmin);
system("pause"); return 0;
}
this is my result
Minimum Value = 9999999
Press any key to continue . . .
I saw a few issues in your code.
As mentioned in the comments from MayurK, you got the indexing wrong.
Also as MayurK said, you are comparing two pointers and not the values they point to.
You kernel invocation code asks for 100 x 100 x 1 grid, with each block containing just 1 thread. This is very bad in terms of efficiency. Also, because of this, your b and k will only range from 0 to 99, as the threadIdx.x will always be zero.
Finally, all threads will be running in parallel, resulting in a race condition in kmin = mat[b][k] (which should be *kmin by the way). When you fixed the indexing problem, all threads in the same block will write to the location in global memory at same time. You should use atomicMin() or a parallel reduction for finding the minimum value in parallel.
I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.
/**
* BLOCK_LOW
* Returns the offset of a local array
* with regards to block decomposition
* of a global array.
*
* #param (int) process rank
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) offset of local array in global array
*/
#define BLOCK_LOW(id, p, n) ((id)*(n)/(p))
/**
* BLOCK_HIGH
* Returns the index immediately after the
* end of a local array with regards to
* block decomposition of a global array.
*
* #param (int) process rank
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) offset after end of local array
*/
#define BLOCK_HIGH(id, p, n) (BLOCK_LOW((id)+1, (p), (n)))
/**
* BLOCK_SIZE
* Returns the size of a local array
* with regards to block decomposition
* of a global array.
*
* #param (int) process rank
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) size of local array
*/
#define BLOCK_SIZE(id, p, n) ((BLOCK_HIGH((id), (p), (n))) - (BLOCK_LOW((id), (p), (n))))
/**
* BLOCK_OWNER
* Returns the rank of the process that
* handles a certain local array with
* regards to block decomposition of a
* global array.
*
* #param (int) index in global array
* #param (int) total number of processes
* #param (int) size of global array
* #return (int) rank of process that handles index
*/
#define BLOCK_OWNER(i, p, n) (((p)*((i)+1)-1)/(n))
/*Matricefilenames:
small matrix A.bin of dimension 100 × 50
small matrix B.bin of dimension 50 × 100
large matrix A.bin of dimension 1000 × 500
large matrix B.bin of dimension 500 × 1000
An MPI program should be implemented such that it can
• accept two file names at run-time,
• let process 0 read the A and B matrices from the two data files,
• let process 0 distribute the pieces of A and B to all the other processes,
• involve all the processes to carry out the the chosen parallel algorithm
for matrix multiplication C = A * B ,
• let process 0 gather, from all the other processes, the different pieces
of C ,
• let process 0 write out the entire C matrix to a data file.
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include "mpi-utils.c"
void read_matrix_binaryformat (char*, double***, int*, int*);
void write_matrix_binaryformat (char*, double**, int, int);
void create_matrix (double***,int,int);
void matrix_multiplication (double ***, double ***, double ***,int,int, int);
int main(int argc, char *argv[]) {
int id,p; // Process rank and total amount of processes
int rowsA, colsA, rowsB, colsB; // Matrix dimensions
double **A; // Matrix A
double **B; // Matrix B
double **C; // Result matrix C : AB
int local_rows; // Local row dimension of the matrix A
double **local_A; // The local A matrix
double **local_C; // The local C matrix
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &id);
MPI_Comm_size (MPI_COMM_WORLD, &p);
if(argc != 3) {
if(id == 0) {
printf("Usage:\n>> %s matrix_A matrix_B\n",argv[0]);
}
MPI_Finalize();
exit(1);
}
if (id == 0) {
read_matrix_binaryformat (argv[1], &A, &rowsA, &colsA);
read_matrix_binaryformat (argv[2], &B, &rowsB, &colsB);
}
if (p == 1) {
create_matrix(&C,rowsA,colsB);
matrix_multiplication (&A,&B,&C,rowsA,colsB,colsA);
char* filename = "matrix_C.bin";
write_matrix_binaryformat (filename, C, rowsA, colsB);
free(A);
free(B);
free(C);
MPI_Finalize();
return 0;
}
// For this assignment we have chosen to bcast the whole matrix B:
MPI_Bcast (&B, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast (&colsA, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&colsB, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&rowsA, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&rowsB, 1, MPI_INT, 0, MPI_COMM_WORLD);
local_rows = BLOCK_SIZE(id, p, rowsA);
/* SCATTER VALUES */
int *proc_elements = (int*)malloc(p*sizeof(int)); // amount of elements for each processor
int *displace = (int*)malloc(p*sizeof(int)); // displacement of elements for each processor
int i;
for (i = 0; i<p; i++) {
proc_elements[i] = BLOCK_SIZE(i, p, rowsA)*colsA;
displace[i] = BLOCK_LOW(i, p, rowsA)*colsA;
}
create_matrix(&local_A,local_rows,colsA);
MPI_Scatterv(&A[0],&proc_elements[0],&displace[0],MPI_DOUBLE,&local_A[0],
local_rows*colsA,MPI_DOUBLE,0,MPI_COMM_WORLD);
/* END SCATTER VALUES */
create_matrix (&local_C,local_rows,colsB);
matrix_multiplication (&local_A,&B,&local_C,local_rows,colsB,colsA);
/* GATHER VALUES */
MPI_Gatherv(&local_C[0], rowsA*colsB, MPI_DOUBLE,&C[0],
&proc_elements[0],&displace[0],MPI_DOUBLE,0, MPI_COMM_WORLD);
/* END GATHER VALUES */
char* filename = "matrix_C.bin";
write_matrix_binaryformat (filename, C, rowsA, colsB);
free (proc_elements);
free (displace);
free (local_A);
free (local_C);
free (A);
free (B);
free (C);
MPI_Finalize ();
return 0;
}
void create_matrix (double ***C,int rows,int cols) {
*C = (double**)malloc(rows*sizeof(double*));
(*C)[0] = (double*)malloc(rows*cols*sizeof(double));
int i;
for (i=1; i<rows; i++)
(*C)[i] = (*C)[i-1] + cols;
}
void matrix_multiplication (double ***A, double ***B, double ***C, int rowsC,int colsC,int colsA) {
double sum;
int i,j,k;
for (i = 0; i < rowsC; i++) {
for (j = 0; j < colsC; j++) {
sum = 0.0;
for (k = 0; k < colsA; k++) {
sum = sum + (*A)[i][k]*(*B)[k][j];
}
(*C)[i][j] = sum;
}
}
}
/* Reads a 2D array from a binary file*/
void read_matrix_binaryformat (char* filename, double*** matrix, int* num_rows, int* num_cols) {
int i;
FILE* fp = fopen (filename,"rb");
fread (num_rows, sizeof(int), 1, fp);
fread (num_cols, sizeof(int), 1, fp);
/* storage allocation of the matrix */
*matrix = (double**)malloc((*num_rows)*sizeof(double*));
(*matrix)[0] = (double*)malloc((*num_rows)*(*num_cols)*sizeof(double));
for (i=1; i<(*num_rows); i++)
(*matrix)[i] = (*matrix)[i-1]+(*num_cols);
/* read in the entire matrix */
fread ((*matrix)[0], sizeof(double), (*num_rows)*(*num_cols), fp);
fclose (fp);
}
/* Writes a 2D array in a binary file */
void write_matrix_binaryformat (char* filename, double** matrix, int num_rows, int num_cols) {
FILE *fp = fopen (filename,"wb");
fwrite (&num_rows, sizeof(int), 1, fp);
fwrite (&num_cols, sizeof(int), 1, fp);
fwrite (matrix[0], sizeof(double), num_rows*num_cols, fp);
fclose (fp);
}
My task is to do a parallel matrix multiplication of matrix A and B and gather the results in matrix C.
I am doing this by dividing matrix A in rowwise pieces and each process is going to use its piece to multiply matrix B, and get back its piece from the multiplication. Then I am going to gather all the pieces from the processes and put them together to matrix C.
I allready posted a similiar question, but this code is improved and I have progressed but I am still getting a segmentation fault after the scatterv call.
So I see a few problems right away:
MPI_Bcast (&B, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
Here, you're passing not a pointer to doubles, but a pointer to a pointer to a pointer to a double (B is defined as double **B) and you're telling MPI to follow that pointer and send 1 double from there. That is not going to work.
You might think that what you're accomplishing here is sending the pointer to the matrix, from which all tasks can read the array -- that doesn't work. The processes don't share a common memory space (that's why MPI is called distributed memory programming) and the pointer doesn't go anywhere. You're actually going to have to send the contents of the matrix,
MPI_Bcast (&(B[0][0]), rowsB*colsB, MPI_DOUBLE, 0, MPI_COMM_WORLD);
and you're going to have to make sure the other processes have correctly allocated memory for the B matrix ahead of time.
There's similar pointer problems elsewhere:
MPI_Scatterv(&A[0], ..., &local_A[0]
Again, A is a pointer to a pointer to doubles (double **A) as is local_A, and you need to be pointing MPI to pointer to doubles for this to work, something like
MPI_Scatterv(&(A[0][0]), ..., &(local_A[0][0])
that error seems to be present in all the communications routines.
Remember that anything that looks like (buffer, count, TYPE) in MPI means that the MPI routines follow the pointer buffer and send the next count pieces of data of type TYPE there. MPI can't follow pointers within the buffer you sent becaue in general it doens't know they're there. It just takes the next (count * sizeof(TYPE)) bytes from pointer buffer and does whatever communications is appropriate with them. So you have to pass it a pointer to a stream of data of type TYPE.
Having said all that, it would be a lot easier to work with you on this if you had narrowed things down a bit; right now the program you've posted includes a lot of I/O stuff that's irrelevant, and it means that no one can just run your program to see what happens without first figuring out the matrix format and then generating two matrices on their own. When posting a question about source code, you really want to post a (a) small bit of source which (b) reproduces the problem and (c) is completely self-contained.
Consider this an extended comment as Jonathan Dursi has already given a fairly elaborate answer. You matrices are really represented in a weird way but at least you followed the advice given to your other question and allocate space for them as contiguous blocks and not separately for each row.
Given that, you should replace:
MPI_Scatterv(&A[0],&proc_elements[0],&displace[0],MPI_DOUBLE,&local_A[0],
local_rows*colsA,MPI_DOUBLE,0,MPI_COMM_WORLD);
with
MPI_Scatterv(A[0],&proc_elements[0],&displace[0],MPI_DOUBLE,local_A[0],
local_rows*colsA,MPI_DOUBLE,0,MPI_COMM_WORLD);
A[0] already points to the beginning of the matrix data and there is no need to make a pointer to it. The same goes for local_A[0] as well as for the parameters to the MPI_Gatherv() call.
It has been said many times already - MPI doesn't do pointer chasing and only works with flat buffers.
I've also noticed another mistake in your code - memory for your matrices is not freed correctly. You are only freeing the array of pointers and not the matrix data itself:
free(A);
should really become
free(A[0]); free(A);