Transferring data from 2d Dynamic array in C to CUDA and back

Transferring data from 2d Dynamic array in C to CUDA and back - c

I have a dynamically declared 2D array in my C program, the contents of which I want to transfer to a CUDA kernel for further processing. Once processed, I want to populate the dynamically declared 2D array in my C code with the CUDA processed data. I am able to do this with static 2D C arrays but not with dynamically declared C arrays. Any inputs would be welcome!
I mean the dynamic array of dynamic arrays. The test code that I have written is as below.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <stdlib.h>
const int nItt = 10;
const int nP = 5;
__device__ int d_nItt = 10;
__device__ int d_nP = 5;
__global__ void arr_chk(float *d_x_k, float *d_w_k, int row_num)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
int index1 = (row_num * d_nP) + index;
if ( (index1 >= row_num * d_nP) && (index1 < ((row_num +1)*d_nP))) //Modifying only one row data pertaining to one particular iteration
{
d_x_k[index1] = row_num * d_nP;
d_w_k[index1] = index;
}
}
float **mat_create2(int r, int c)
{
float **dynamicArray;
dynamicArray = (float **) malloc (sizeof (float)*r);
for(int i=0; i<r; i++)
{
dynamicArray[i] = (float *) malloc (sizeof (float)*c);
for(int j= 0; j<c;j++)
{
dynamicArray[i][j] = 0;
}
}
return dynamicArray;
}
/* Freeing memory - here only number of rows are passed*/
void cleanup2d(float **mat_arr, int x)
{
int i;
for(i=0; i<x; i++)
{
free(mat_arr[i]);
}
free(mat_arr);
}
int main()
{
//float w_k[nItt][nP]; //Static array declaration - works!
//float x_k[nItt][nP];
// if I uncomment this dynamic declaration and comment the static one, it does not work.....
float **w_k = mat_create2(nItt,nP);
float **x_k = mat_create2(nItt,nP);
float *d_w_k, *d_x_k; // Device variables for w_k and x_k
int nblocks, blocksize, nthreads;
for(int i=0;i<nItt;i++)
{
for(int j=0;j<nP;j++)
{
x_k[i][j] = (nP*i);
w_k[i][j] = j;
}
}
for(int i=0;i<nItt;i++)
{
for(int j=0;j<nP;j++)
{
printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]);
printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]);
}
}
int size1 = nItt * nP * sizeof(float);
printf("\nThe array size in memory bytes is: %d\n",size1);
cudaMalloc( (void**)&d_x_k, size1 );
cudaMalloc( (void**)&d_w_k, size1 );
if((nP*nItt)<32)
{
blocksize = nP*nItt;
nblocks = 1;
}
else
{
blocksize = 32; // Defines the number of threads running per block. Taken equal to warp size
nthreads = blocksize;
nblocks = ceil(float(nP*nItt) / nthreads); // Calculated total number of blocks thus required
}
for(int i = 0; i< nItt; i++)
{
cudaMemcpy( d_x_k, x_k, size1,cudaMemcpyHostToDevice ); //copy of x_k to device
cudaMemcpy( d_w_k, w_k, size1,cudaMemcpyHostToDevice ); //copy of w_k to device
arr_chk<<<nblocks, blocksize>>>(d_x_k,d_w_k,i);
cudaMemcpy( x_k, d_x_k, size1, cudaMemcpyDeviceToHost );
cudaMemcpy( w_k, d_w_k, size1, cudaMemcpyDeviceToHost );
}
printf("\nVerification after return from gpu\n");
for(int i = 0; i<nItt; i++)
{
for(int j=0;j<nP;j++)
{
printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]);
printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]);
}
}
cudaFree( d_x_k );
cudaFree( d_w_k );
cleanup2d(x_k,nItt);
cleanup2d(w_k,nItt);
getch();
return 0;

I mean the dynamic array of dynamic arrays.
Well, that's exactly where the problem lies. A dynamic array of dynamic arrays consists of a whole bunch of disjoint memory blocks, one for each line in the array (as is clearly seen from the malloc inside you for loop in mat_create2). So you can't copy such a data structure to device memory with just one call to cudaMemcpy*. Instead, you have to do either
Also use dynamic arrays of dynamic arrays on CUDA. To do this, you have to basically recreate your mat_create2 function, using cudaMalloc instead of malloc, then copy each row seperately.
Use a "tight" 2d array on CUDA, like you do now (which is a good thing, at least performance-wise!). But if you keep using dyn-dyn-arrays on host memory, you still have copy each row seperately, like
for(int i=0; i<r; ++i){
cudaMemcpy(d_x_k + i*c, x_k[i], c*sizeof(float), cudaMemcpyHostToDevice)
}
You may wonder "why did it work with a static 2d array, then"? Well, static 2d arrays in C are proper, tight arrays that can be copied in one go. It's a bit confusing that these are indexed with exactly the same syntax as dyn-dyn arrays (arr[x][y]), because it actually works completely different.
But you should consider using tight arrays on host memory, too, perhaps with an object-oriented wrapper like
typedef struct {
float* data;
int n_rows, n_cols;
} tight2dFloatArray;
#define INDEX_TIGHT2DARRAY(arr, y, x)\
(arr).data[(y)*(arr).n_cols + (x)]
such an approach of course can be implemented much safer as a C++ class.
*You also can't copy it inside main memory with just one memcpy: that only copies the array of pointers, not the actual data.

Related

How to operate matrices of different size with one function in C?

I have a code from Mathlab, where all matrix operations are done by a couple of symbols. By translating it into C I faced a problem that for every size of matrix I have to create a special function. It's a big code, i will not place it all here but will try to explain how it works.
I also have a big loop where a lot of matrix operations are going on. Functions which are operating with matrices should take matrices as income and store results in temporary matrices for upcoming operations. In fact i know the size of matrices but i also want to make the functions as universal as possible. In oder to reduce code size and save my time.
For example, matrix transposition operation of 2x4 and 4x4 matrices:
void A_matrix_transposition (float transposed_matrix[4][2], float matrix[2][4], int rows_in_matrix, int columnes_in_matrix);
void B_matrix_transposition (float transposed_matrix[4][4], float matrix[4][4], int rows_in_matrix, int columnes_in_matrix);
int main() {
float transposed_matrix_A[4][2]; //temporary matrices
float transposed_matrix_B[4][4];
float input_matrix_A[2][4], input_matrix_B[4][4]; //input matrices with numbers
A_matrix_transposition (transposed_matrix_A, input_matrix_A, 2, 4);
B_matrix_transposition (transposed_matrix_B, input_matrix_B, 4, 4);
// after calling the functions i want to use temporary matrices again. How do I pass them to other functions if i dont know their size, in general?
}
void A_matrix_transposition (float transposed_matrix[4][2], float matrix[2][4], int rows_in_matrix, int columnes_in_matrix)
{ static int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{ transposed_matrix[j][i] = matrix[i][j];
}
}
}
void B_matrix_transposition (float transposed_matrix[4][4], float matrix[4][4], int rows_in_matrix, int columnes_in_matrix)
{ static int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{ transposed_matrix[j][i] = matrix[i][j];
}
}
}
The operation is simple, but the code is massive already because of 2 different functions, but it will be a slow disaster if I continue like this.
How do i create one function for transposing to operate matrices of different sizes?
I suppose it can be done with pointers, but I don't know how.
I'm looking for a realy general answer to understand how to tune up the "comunication" between functions and temporary matrices, best with an example. Thank you all in advance for the information and help.

There are different way you can achieve this in c from not so good to good solutions.
If you know what the maximum size of the matrices would be you can create a matrix big enough to accommodate that size and work on it. If it is lesser than that - no problem write custom operations only considering that small sub-matrix rather than the whole one.
Another solution is to - create a data structure to hold the matrix this may vary from jagged array creation which can be done using the attribute that is stored in the structure itself. For example: number of rows and column information will be stored in the structure itself. Jagged array gives you the benefit that now you can allocate de-allocate memory - giving you a better control over the form - order of the matrices. This is better in that - now you can pass two matrices of different sizes and the functions all see that structure which contain the actual matrix and work on it. (wrapped I would say).
By Structure I meant something like
struct matrix{
int ** mat;
int row;
int col;
}

If your C implementation supports variable length arrays, then you can accomplish this with:
void matrix_transposition(size_t M, size_t N,
float Destination[M][N], const float Source[N][M])
{
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m][n] = Source[n][m];
}
If your C implementation does not support variable length arrays, but does allow pointers to arrays to be cast to pointers to elements and used to access a two-dimensional array as if it were one-dimensional (this is not standard C but may be supported by a compiler), you can use:
void matrix_transposition(size_t M, size_t N,
float *Destination, const float *Source)
{
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m*N+n] = Source[n*M+m];
}
The above requires the caller to cast the arguments to float *. We can make it more convenient for the caller with:
void matrix_transposition(size_t M, size_t N,
void *DestinationPointer, const void *SourcePointer)
{
float *Destination = DestinationPointer;
const float *Source = SourcePointer;
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m*N+n] = Source[n*M+m];
}
(Unfortunately, this prevents the compiler from checking that the argument types match the intended types, but this is a shortcoming of C.)
If you need a solution strictly in standard C without variable length arrays, then, technically, the proper way is to copy the bytes of the objects:
void matrix_transposition(size_t M, size_t N,
void *DestinationPointer, const void *SourcePointer)
{
char *Destination = DestinationPointer;
const char *Source = SourcePointer;
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
{
// Calculate locations of elements in memory.
char *D = Destination + (m*N+n) * sizeof(float);
const char *S = Source + (n*M+m) * sizeof(float);
memcpy(D, S, sizeof(float));
}
}
Notes:
Include <stdlib.h> to declare size_t and, if using the last solution, include <string.h> to declare memcpy.
Variable length arrays were required in C 1999 but made optional in C 2011. Good quality compilers for general purpose systems will support them.

If you are using C99 compiler, you can make use of Variable Length Array (VLA's) (optional in C11 compiler). You can write a function like this:
void matrix_transposition (int rows_in_matrix, int columnes_in_matrix, float transposed_matrix[columnes_in_matrix][rows_in_matrix], float matrix[rows_in_matrix][columnes_in_matrix])
{
int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{
transposed_matrix[j][i] = matrix[i][j];
}
}
}
This one function can work for the different number of rows_in_matrix and columnes_in_matrix. Call it like this:
matrix_transposition (2, 4, transposed_matrix_A, input_matrix_A);
matrix_transposition (4, 4, transposed_matrix_B, input_matrix_B);

You probably don't want to be hard-coding array sizes in your program. I suggest a structure that contains a single flat array, which you can then interpret in two dimensions:
typedef struct {
size_t width;
size_t height;
float *elements;
} Matrix;
Initialize it with
int matrix_init(Matrix *m, size_t w, size_t h)
{
m.elements = malloc((sizeof *m.elements) * w * h);
if (!m.elements) {
m.width = m.height = 0;
return 0; /* failed */
}
m.width = w;
m.height = h;
return 1; /* success */
}
Then, to find the element at position (x,y), we can use a simple function:
float *matrix_element(Matrix *m, size_t x, size_t y)
{
/* optional: range checking here */
return m.elements + x + m.width * y;
}
This has better locality than an array of pointers (and is easier and faster to allocate and deallocate correctly), and is more flexible than an array of arrays (where, as you've found, the inner arrays need a compile-time constant size).
You might be able to use an array of arrays wrapped in a Matrix struct - it's possible you'll need a stride that is not necessarily the same as width, if the array of arrays has padding on your platform.

Segmentation fault (core dumped) [Conway's game of life]

I'm working on a C implementation for Conway's game of life, I have been asked to use the following header:
#ifndef game_of_life_h
#define game_of_life_h
#include <stdio.h>
#include <stdlib.h>
// a structure containing a square board for the game and its size
typedef struct gol{
int **board;
size_t size;
} gol;
// dynamically creates a struct gol of size 20 and returns a pointer to it
gol* create_default_gol();
// creates dynamically a struct gol of a specified size and returns a pointer to it.
gol* create_gol(size_t size);
// destroy gol structures
void destroy_gol(gol* g);
// the board of 'g' is set to 'b'. You do not need to check if 'b' has a proper size and values
void set_pattern(gol* g, int** b);
// using rules of the game of life, the function sets next pattern to the g->board
void next_pattern(gol* g);
/* returns sum of all the neighbours of the cell g->board[i][j]. The function is an auxiliary
function and should be used in the following function. */
int neighbour_sum(gol* g, int i, int j);
// prints the current pattern of the g-board on the screen
void print(gol* g);
#endif
I have added the comments to help out with an explanation of what each bit is.
gol.board is a 2-level integer array, containing x and y coordinates, ie board[x][y], each coordinate can either be a 1 (alive) or 0 (dead).
This was all a bit of background information, I'm trying to write my first function create_default_gol() that will return a pointer to a gol instance, with a 20x20 board.
I then attempt to go through each coordinate through the 20x20 board and set it to 0, I am getting a Segmentation fault (core dumped) when running this program.
The below code is my c file containing the core code, and the main() function:
#include "game_of_life.h"
int main()
{
// Create a 20x20 game
gol* g_temp = create_default_gol();
int x,y;
for (x = 0; x < 20; x++)
{
for (y = 0; y < 20; y++)
{
g_temp->board[x][y] = 0;
}
}
free(g_temp);
}
// return a pointer to a 20x20 game of life
gol* create_default_gol()
{
gol* g_rtn = malloc(sizeof(*g_rtn) + (sizeof(int) * 20 * 20));
return g_rtn;
}
This is the first feature I'd like to implement, being able to generate a 20x20 board with 0's (dead) state for every coordinate.
Please feel free to criticise my code, I'm looking to determine why I'm getting the segmentation fault, and if I'm allocating memory properly in the create_default_gol() function.
Thanks!

The type int **board; means that board must contain an array of pointers, each of which points to the start of each row. Your existing allocation omits this, and just allocates *g_rtn plus the ints in the board.
The canonical way to allocate your board, supposing that you must stick to the type int **board;, is:
gol* g_rtn = malloc(sizeof *g_rtn);
g_rtn->size = size;
g_rtn->board = malloc(size * sizeof *g_rtn->board);
for (int i = 0; i < size; ++i)
g_rtn->board[i] = malloc(size * sizeof **g_rtn->board);
This code involves a lot of small malloc chunks. You could condense the board rows and columns into a single allocation, but then you also need to set up pointers to the start of each row, because board must be an array of pointers to int.
Another issue with this approach is alignment. It's guaranteed that a malloc result is aligned for any type; however it is possible that int has stricter alignment requirements than int *. My following code assumes that it doesn't; if you want to be portable then you could add in some compile-time checks (or run it and see if it aborts!).
The amount of memory required is the sum of the last two mallocs:
g_rtn->board = malloc( size * size * sizeof **g_rtn->board
+ size * sizeof *g_rtn->board );
Then the first row will start after the end of the row-pointers (a cast is necessary because we are converting int ** to int *, and using void * means we don't have to repeat the word int):
g_rtn->board[0] = (void *) (g_rtn->board + size);
And the other rows each have size ints in them:
for (int i = 1; i < size; ++i)
g_rtn->board[i] = g_rtn->board[i-1] + size;
Note that this is a whole lot more complicated than just using a 1-D array and doing arithmetic for the offsets, but it was stipulated that you must have two levels of indirection to access the board.
Also this is more complicated than the "canonical" version. In this version we are trading code complexity for the benefit of having a reduced number of mallocs. If your program typically only allocates one board, or a small number of boards, then perhaps this trade-off is not worth it and the canonical version would give you fewer headaches.
Finally - it would be possible to allocate both *g_rtn and the board in the single malloc, as you attempted to do in your question. However my advice (based on experience) is that it is simpler to keep the board separate. It makes your code clearer, and your object easier to use and make changes to, if the board is a separate allocation to the game object.

create_default_gol() misses to initialise board, so applying the [] operator to it (in main() ) the program accesses "invaid" memory and with ethis provokes undefined behaviour.
Although enough memory is allocated, the code still needs to make board point to the memory by doing
gol->board = ((char*) gol) + sizeof(*gol);
Update
As pointed out by Matt McNabb's comment board points to an array of pointers to int, so initialisation is more complicate:
gol * g_rtn = malloc(sizeof(*g_rtn) + 20 * sizeof(*gol->board));
g_rtn->board = ((char*) gol) + sizeof(*gol);
for (size_t i = 0; i<20; ++i)
{
g_rtn->board[i] = malloc(20 * sizeof(*g_rtn->board[i])
}
Also the code misses to set gol's member size. From what you tell us it is not clear whether it shall hold the nuber of bytes, rows/columns or fields.
Also^2 coding "magic numbers" like 20 is bad habit.
Also^3 create_default_gol does not specify any parameters, which explictily allows any numberm and not none as you might perhaps have expected.
All in all I'd code create_default_gol() like this:
gol * create_default_gol(const size_t rows, const size_t columns)
{
size_t size_rows = rows * sizeof(*g_rtn->board));
size_t size_column = columns * sizeof(**g_rtn->board));
gol * g_rtn = malloc(sizeof(*g_rtn) + size_rows);
g_rtn->board = ((char*) gol) + sizeof(*gol);
if (NULL ! = g_rtn)
{
for (size_t i = 0; i<columns; ++i)
{
g_rtn->board[i] = malloc(size_columns); /* TODO: Add error checking here. */
}
g_rtn->size = size_rows * size_columns; /* Or what ever this attribute is meant for. */
}
return g_rtn;
}

gol* create_default_gol()
{
int **a,i;
a = (int**)malloc(20 * sizeof(int *));
for (i = 0; i < 20; i++)
a[i] = (int*)malloc(20 * sizeof(int));
gol* g_rtn = (gol*)malloc(sizeof(*g_rtn));
g_rtn->board = a;
return g_rtn;
}
int main()
{
// Create a 20x20 game
gol* g_temp = create_default_gol();
int x,y;
for (x = 0; x < 20; x++)
{
for (y = 0; y < 20; y++)
{
g_temp->board[x][y] = 10;
}
}
for(x=0;x<20;x++)
free(g_temp->board[x]);
free(g_temp->board);
free(g_temp);
}

main (void)
{
gol* gameOfLife;
gameOfLife = create_default_gol();
free(gameOfLife);
}
gol* create_default_gol()
{
int size = 20;
gol* g_rtn = malloc(sizeof *g_rtn);
g_rtn = malloc(sizeof g_rtn);
g_rtn->size = size;
g_rtn->board = malloc(size * sizeof *g_rtn->board);
int i, b;
for (i = 0; i < size; ++i){
g_rtn->board[i] = malloc(sizeof (int) * size);
for(b=0;b<size;b++){
g_rtn->board[i][b] = 0;
}
}
return g_rtn;
}
Alternatively, since you also need to add a create_gol(size_t new_size) of custom size, you could also write it as the following.
main (void)
{
gol* gameOfLife;
gameOfLife = create_default_gol();
free(gameOfLife);
}
gol* create_default_gol()
{
size_t size = 20;
return create_gol(size);
}
gol* create_gol(size_t new_size)
{
gol* g_rtn = malloc(sizeof *g_rtn);
g_rtn = malloc(sizeof g_rtn);
g_rtn->size = new_size;
g_rtn->board = malloc(size * sizeof *g_rtn->board);
int i, b;
for (i = 0; i < size; ++i){
g_rtn->board[i] = malloc(sizeof (int) * size);
for(b=0;b<size;b++){
g_rtn->board[i][b] = 0;
}
}
return g_rtn;
}
Doing this just minimizes the amount of code needed.

Sending 2D array to Cuda Kernel

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );

You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.

2d array in C with negative indices

I am writing a C-program where I need 2D-arrays (dynamically allocated) with negative indices or where the index does not start at zero. So for an array[i][j] the row-index i should take values from e.g. 1 to 3 and the column-index j should take values from e.g. -1 to 9.
For this purpose I created the following program, here the variable columns_start is set to zero, so just the row-index is shifted and this works really fine.
But when I assign other values than zero to the variable columns_start, I get the message (from valgrind) that the command "free(array[i]);" is invalid.
So my questions are:
Why it is invalid to free the memory that I allocated just before?
How do I have to modify my program to shift the column-index?
Thank you for your help.
#include <stdio.h>
#include <stdlib.h>
main()
{
int **array, **array2;
int rows_end, rows_start, columns_end, columns_start, i, j;
rows_start = 1;
rows_end = 3;
columns_start = 0;
columns_end = 9;
array = malloc((rows_end-rows_start+1) * sizeof(int *));
for(i = 0; i <= (rows_end-rows_start); i++) {
array[i] = malloc((columns_end-columns_start+1) * sizeof(int));
}
array2 = array-rows_start; //shifting row-index
for(i = rows_start; i <= rows_end; i++) {
array2[i] = array[i-rows_start]-columns_start; //shifting column-index
}
for(i = rows_start; i <= rows_end; i++) {
for(j = columns_start; j <= columns_end; j++) {
array2[i][j] = i+j; //writing stuff into array
printf("%i %i %d\n",i, j, array2[i][j]);
}
}
for(i = 0; i <= (rows_end-rows_start); i++) {
free(array[i]);
}
free(array);
}

When you shift column indexes, you assign new values to original array of columns: in
array2[i] = array[i-rows_start]-columns_start;
array2[i] and array[i=rows_start] are the same memory cell as array2 is initialized with array-rows_start.
So deallocation of memory requires reverse shift. Try the following:
free(array[i] + columns_start);
IMHO, such modification of array indexes gives no benefit, while complicating program logic and leading to errors. Try to modify indexes on the fly in single loop.

#include <stdio.h>
#include <stdlib.h>
int main(void) {
int a[] = { -1, 41, 42, 43 };
int *b;//you will always read the data via this pointer
b = &a[1];// 1 is becoming the "zero pivot"
printf("zero: %d\n", b[0]);
printf("-1: %d\n", b[-1]);
return EXIT_SUCCESS;
}
If you don't need just a contiguous block, then you may be better off with hash tables instead.

As far as I can see, your free and malloc looks good. But your shifting doesn't make sense. Why don't you just add an offset in your array instead of using array2:
int maxNegValue = 10;
int myNegValue = -6;
array[x][myNegValue+maxNegValue] = ...;
this way, you're always in the positive range.
For malloc: you acquire (maxNegValue + maxPosValue) * sizeof(...)
Ok I understand now, that you need free(array.. + offset); even using your shifting stuff.. that's probably not what you want. If you don't need a very fast implementation I'd suggest to use a struct containing the offset and an array. Then create a function having this struct and x/y as arguments to allow access to the array.

I don't know why valgrind would complain about that free statement, but there seems to be a lot of pointer juggling going on so it doesn't surprise me that you get this problem in the first place. For instance, one thing which caught my eye is:
array2 = array-rows_start;
This will make array2[0] dereference memory which you didn't allocate. I fear it's just a matter of time until you get the offset calcuations wrong and run into this problem.
One one comment you wrote
but im my program I need a lot of these arrays with all different beginning indices, so I hope to find a more elegant solution instead of defining two offsets for every array.
I think I'd hide all this in a matrix helper struct (+ functions) so that you don't have to clutter your code with all the offsets. Consider this in some matrix.h header:
struct matrix; /* opaque type */
/* Allocates a matrix with the given dimensions, sample invocation might be:
*
* struct matrix *m;
* matrix_alloc( &m, -2, 14, -9, 33 );
*/
void matrix_alloc( struct matrix **m, int minRow, int maxRow, int minCol, int maxCol );
/* Releases resources allocated by the given matrix, e.g.:
*
* struct matrix *m;
* ...
* matrix_free( m );
*/
void matrix_free( struct matrix *m );
/* Get/Set the value of some elment in the matrix; takes logicaly (potentially negative)
* coordinates and translates them to zero-based coordinates internally, e.g.:
*
* struct matrix *m;
* ...
* int val = matrix_get( m, 9, -7 );
*/
int matrix_get( struct matrix *m, int row, int col );
void matrix_set( struct matrix *m, int row, int col, int val );
And here's how an implementation might look like (this would be matrix.c):
struct matrix {
int minRow, maxRow, minCol, maxCol;
int **elem;
};
void matrix_alloc( struct matrix **m, int minCol, int maxCol, int minRow, int maxRow ) {
int numRows = maxRow - minRow;
int numCols = maxCol - minCol;
*m = malloc( sizeof( struct matrix ) );
*elem = malloc( numRows * sizeof( *elem ) );
for ( int i = 0; i < numRows; ++i )
*elem = malloc( numCols * sizeof( int ) );
/* setting other fields of the matrix omitted for brevity */
}
void matrix_free( struct matrix *m ) {
/* omitted for brevity */
}
int matrix_get( struct matrix *m, int col, int row ) {
return m->elem[row - m->minRow][col - m->minCol];
}
void matrix_set( struct matrix *m, int col, int row, int val ) {
m->elem[row - m->minRow][col - m->minCol] = val;
}
This way you only need to get this stuff right once, in a central place. The rest of your program doesn't have to deal with raw arrays but rather the struct matrix type.

Using malloc for allocation of multi-dimensional arrays with different row lengths

I have the following C code :
int *a;
size_t size = 2000*sizeof(int);
a = malloc(size);
which works fine. But if I have the following :
char **b = malloc(2000*sizeof *b);
where every element of b has different length.
How is it possible to do the same thing for b as i did for a; i.e. the following code would hold correct?
char *c;
size_t size = 2000*sizeof(char *);
c = malloc(size);

First, you need to allocate array of pointers like char **c = malloc( N * sizeof( char* )), then allocate each row with a separate call to malloc, probably in the loop:
/* N is the number of rows */
/* note: c is char** */
if (( c = malloc( N*sizeof( char* ))) == NULL )
{ /* error */ }
for ( i = 0; i < N; i++ )
{
/* x_i here is the size of given row, no need to
* multiply by sizeof( char ), it's always 1
*/
if (( c[i] = malloc( x_i )) == NULL )
{ /* error */ }
/* probably init the row here */
}
/* access matrix elements: c[i] give you a pointer
* to the row array, c[i][j] indexes an element
*/
c[i][j] = 'a';
If you know the total number of elements (e.g. N*M) you can do this in a single allocation.

The typical form for dynamically allocating an NxM array of type T is
T **a = malloc(sizeof *a * N);
if (a)
{
for (i = 0; i < N; i++)
{
a[i] = malloc(sizeof *a[i] * M);
}
}
If each element of the array has a different length, then replace M with the appropriate length for that element; for example
T **a = malloc(sizeof *a * N);
if (a)
{
for (i = 0; i < N; i++)
{
a[i] = malloc(sizeof *a[i] * length_for_this_element);
}
}

Equivalent memory allocation for char a[10][20] would be as follows.
char **a;
a=malloc(10*sizeof(char *));
for(i=0;i<10;i++)
a[i]=malloc(20*sizeof(char));
I hope this looks simple to understand.

The other approach would be to allocate one contiguous chunk of memory comprising header block for pointers to rows as well as body block to store actual data in rows. Then just mark up memory by assigning addresses of memory in body to the pointers in header on per-row basis. It would look like follows:
int** 2dAlloc(int rows, int* columns) {
int header = rows * sizeof(int*);
int body = 0;
for(int i=0; i<rows; body+=columnSizes[i++]) {
}
body*=sizeof(int);
int** rowptr = (int**)malloc(header + body);
int* buf = (int*)(rowptr + rows);
rowptr[0] = buf;
int k;
for(k = 1; k < rows; ++k) {
rowptr[k] = rowptr[k-1] + columns[k-1];
}
return rowptr;
}
int main() {
// specifying column amount on per-row basis
int columns[] = {1,2,3};
int rows = sizeof(columns)/sizeof(int);
int** matrix = 2dAlloc(rows, &columns);
// using allocated array
for(int i = 0; i<rows; ++i) {
for(int j = 0; j<columns[i]; ++j) {
cout<<matrix[i][j]<<", ";
}
cout<<endl;
}
// now it is time to get rid of allocated
// memory in only one call to "free"
free matrix;
}
The advantage of this approach is elegant freeing of memory and ability to use array-like notation to access elements of the resulting 2D array.

If every element in b has different lengths, then you need to do something like:
int totalLength = 0;
for_every_element_in_b {
totalLength += length_of_this_b_in_bytes;
}
return malloc(totalLength);

I think a 2 step approach is best, because c 2-d arrays are just and array of arrays. The first step is to allocate a single array, then loop through it allocating arrays for each column as you go. This article gives good detail.

2-D Array Dynamic Memory Allocation
int **a,i;
// for any number of rows & columns this will work
a = malloc(rows*sizeof(int *));
for(i=0;i<rows;i++)
*(a+i) = malloc(cols*sizeof(int));

malloc does not allocate on specific boundaries, so it must be assumed that it allocates on a byte boundary.
The returned pointer can then not be used if converted to any other type, since accessing that pointer will probably produce a memory access violation by the CPU, and the application will be immediately shut down.