Sending 2D array to Cuda Kernel

Sending 2D array to Cuda Kernel - arrays

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );

You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.

Related

C pointer changes without modifying it

I am getting some weird pointer-related errors that I do not understand how can happen and how to solve it. Since I am new to C any good coding practice advice is also appreciated.
typedef struct matrix {
int n;
int m;
int **data;
int *inverses;
} Matrix;
Matrix m_new(int n, int m, int *inverses) {
int **data = malloc(n);
for (int k = 0; k < n; k++) {
int *row = malloc(m + 1);
*row++ = k;
*(data+k) = row;
// print_arr(*data); - introduced for debugging purposes
}
Matrix mat = {
.n = n,
.m = m,
.data = data,
.inverses = inverses,
};
return mat;
}
void print_arr(int *arr) {
int len = *(arr-1);
printf("len: %d\n", len);
}
Here, I am going to use this matrix for Gaussian Elimination over the integers mod some number, that's why I have the inverses array, and I use int **, so that I can swap rows easily later. For arrays represented as int * I use the -1th entry to store the length of the array. I got some weird bugs, so I wrote a small function to get the length of an array (print_arr) and check the first array in data. I got the following(for n=m=5):
len: 0
len: 0
len: 0
len: 0
len: -1219680380
Which I find really weird, because it seems like that something is changing in the last iteration, but I don't know why.

Your allocations are too small. You need to multiply the number of elements by the size of the elements to compute how much memory to allocate:
int **data = malloc(n * sizeof (int *));
int *row = malloc((m + 1) * sizeof (int));
With your current code, you overflow your too-small buffers and cause undefined behavior.

Problem with free() function in C and memory-leaks

I've a problem about deallocating memory using free() in C.
My program generates a random genealogic tree using a matrix. This matrix can be very huge depending on the number of family members. The program seemed to work fine until I decided to generate more than one tree. I noticed that generating about 100 trees causes my 8GB RAM to fill! I'm sure I can make a better code to reduce the demand of memory, but my problem remains.
I use free() to deallocate memory and there's no error. I installed Valgrind to se what's happening and it says that about 100 million byte per tree are definitely lost. This means that free() doesn't work fine. I don't now where is the problem. I link some functions that I think are correlated to the problem.
typedef struct{
int f_id;
char f_name[L_NAMES];
int generations;
int n_members;
type_people *members;
int_mtx *mtx;
}type_family;
The struct above is for the family.
typedef struct temp{
int p_id;
char name[L_NAMES];
char f_name[L_NAMES];
int generation;
int n_sons;
struct temp **sons;
int f_id;
int sex;
int age;
}type_people;
This is for the members.
typedef struct{
int i;
int j;
int **val;
}int_mtx;
And the matrix.
In the main i call the function to initialize the tree:
type_family *family_a;
family_a = malloc(sizeof(type_family));
family_a = init_family_n_gen(family_a, 6);
This is the frist part of init_family_n_gen():
type_family *init_family_n_gen(type_family *family, int n){
...
family->members = malloc(max_people * sizeof(type_people));
family->mtx = mtxcalloc(family->mtx, max_people, max_people - 1);
...
This code is for mtxcalloc that initializes the matrix:
int_mtx *mtxcalloc(int_mtx *mtx, int i, int j){
mtx = malloc(sizeof(int_mtx));
mtx->i = i;
mtx->j = j;
mtx->val = malloc(i * sizeof(int *));
for(int a = 0; a < i; a++){
mtx->val[a] = malloc(j * sizeof(int));
for(int b = 0; b < j; b++){
mtx->val[a][b] = 0;
}
}
return mtx;
}
And to conclude the code to deallocate the family:
void free_family(type_family *family){
for(int m = 0; m < family->n_members; m++){
if(family->members[m].n_sons != 0){
free(family->members[m].sons);
}
}
mtxfree(family->mtx);
free(family->members);
}
And the one to deallocate the matrix:
void mtxfree(int_mtx *mtx){
for(int i = 0; i < mtx->i; i++){
free(mtx->val[i]);
}
free(mtx->val);
free(mtx);
}
Screen capture of Valgrind output
So I call the free_family(family_a) every time i need to regenerate the family but the memory still increases. (In the photo above the number of byte become 1 billion if i regenerate the family for 50 times).
Thanks for the support!
EDITED
I made a minimal reproducible example that emulates my original code. The structs and variables are the same but I changed the functions according to Weather Vane: they are all void and I pass them the double **.
The init_family_n_gen becomes:
void init_family(type_family **f){
type_family *family = malloc(sizeof(type_family));
family->members = malloc(100 * sizeof(type_people));
for(int m = 0; m < 100; m++){
family->members[m].n_sons = 0;
}
mtxcalloc(&family->mtx, 100, 99);
family->mtx->val[0][1] = 7;
family->mtx->val[9][8] = 1;
mtxrealloc(&family->mtx, 5, 4);
*f = family;
}
The main is:
type_family *family_a;
init_family(&family_a);
free_family(&family_a);
The only thing I added is this function(Is the code right?):
void mtxrealloc(int_mtx **mtx, int i, int j){
(*mtx)->i = i;
(*mtx)->j = j;
(*mtx)->val = realloc((*mtx)->val, (*mtx)->i * sizeof(int *));
for(int a = 0; a < (*mtx)->i; a++){
(*mtx)->val[a] = realloc((*mtx)->val[a], (*mtx)->j * sizeof(int));
}
}
I noticed that the problem occours when i use the realloc function and i can't figure why. I link the images of Valgrind with and without the function mtxrealloc. (I see that there is aslo a 48 byte leak...).
Valgrind with realloc
Valgrind without realloc
Thanks again for your support!

This:
init_family(&family_a);
Causes this code from mtxcalloc to execute:
mtx->val = malloc(i * sizeof(int *));
for(int a = 0; a < i; a++){
mtx->val[a] = malloc(j * sizeof(int));
for(int b = 0; b < j; b++){
mtx->val[a][b] = 0;
}
}
, with i, j = 100, 99. That is, you allocate space for 100 pointers, and for each one, you allocate space for 99 ints. These are then accessible via family_a->mtx.
Very shortly thereafter, you make this call:
mtxrealloc(&family->mtx, 5, 4);
, which does this, among other things:
(*mtx)->val = realloc((*mtx)->val, (*mtx)->i * sizeof(int *));
That loses all the pointers (*mtx)->val[5] through (*mtx)->val[99], each of which is the sole pointer to allocated space sufficient for 99 ints. Overall, sufficient space for 9405 ints is leaked before you even perform any computations with the object you are preparing.
It is unclear why you overallocate, just to immediately (attempt to) free the excess, but perhaps that's an artifact of your code simplification. It would be much better to come up with a way to determine how much space you need in advance, and then allocate only that much in the first place. But if you do need to reallocate this particular data, then you need to first free each of the (*mtx)->val[x] that will be lost. Of course, if you were going to reallocate larger, then you would need to allocate / reallocate all of the (*mtx)->val[x].

Segmentation fault (core dumped) [Conway's game of life]

I'm working on a C implementation for Conway's game of life, I have been asked to use the following header:
#ifndef game_of_life_h
#define game_of_life_h
#include <stdio.h>
#include <stdlib.h>
// a structure containing a square board for the game and its size
typedef struct gol{
int **board;
size_t size;
} gol;
// dynamically creates a struct gol of size 20 and returns a pointer to it
gol* create_default_gol();
// creates dynamically a struct gol of a specified size and returns a pointer to it.
gol* create_gol(size_t size);
// destroy gol structures
void destroy_gol(gol* g);
// the board of 'g' is set to 'b'. You do not need to check if 'b' has a proper size and values
void set_pattern(gol* g, int** b);
// using rules of the game of life, the function sets next pattern to the g->board
void next_pattern(gol* g);
/* returns sum of all the neighbours of the cell g->board[i][j]. The function is an auxiliary
function and should be used in the following function. */
int neighbour_sum(gol* g, int i, int j);
// prints the current pattern of the g-board on the screen
void print(gol* g);
#endif
I have added the comments to help out with an explanation of what each bit is.
gol.board is a 2-level integer array, containing x and y coordinates, ie board[x][y], each coordinate can either be a 1 (alive) or 0 (dead).
This was all a bit of background information, I'm trying to write my first function create_default_gol() that will return a pointer to a gol instance, with a 20x20 board.
I then attempt to go through each coordinate through the 20x20 board and set it to 0, I am getting a Segmentation fault (core dumped) when running this program.
The below code is my c file containing the core code, and the main() function:
#include "game_of_life.h"
int main()
{
// Create a 20x20 game
gol* g_temp = create_default_gol();
int x,y;
for (x = 0; x < 20; x++)
{
for (y = 0; y < 20; y++)
{
g_temp->board[x][y] = 0;
}
}
free(g_temp);
}
// return a pointer to a 20x20 game of life
gol* create_default_gol()
{
gol* g_rtn = malloc(sizeof(*g_rtn) + (sizeof(int) * 20 * 20));
return g_rtn;
}
This is the first feature I'd like to implement, being able to generate a 20x20 board with 0's (dead) state for every coordinate.
Please feel free to criticise my code, I'm looking to determine why I'm getting the segmentation fault, and if I'm allocating memory properly in the create_default_gol() function.
Thanks!

The type int **board; means that board must contain an array of pointers, each of which points to the start of each row. Your existing allocation omits this, and just allocates *g_rtn plus the ints in the board.
The canonical way to allocate your board, supposing that you must stick to the type int **board;, is:
gol* g_rtn = malloc(sizeof *g_rtn);
g_rtn->size = size;
g_rtn->board = malloc(size * sizeof *g_rtn->board);
for (int i = 0; i < size; ++i)
g_rtn->board[i] = malloc(size * sizeof **g_rtn->board);
This code involves a lot of small malloc chunks. You could condense the board rows and columns into a single allocation, but then you also need to set up pointers to the start of each row, because board must be an array of pointers to int.
Another issue with this approach is alignment. It's guaranteed that a malloc result is aligned for any type; however it is possible that int has stricter alignment requirements than int *. My following code assumes that it doesn't; if you want to be portable then you could add in some compile-time checks (or run it and see if it aborts!).
The amount of memory required is the sum of the last two mallocs:
g_rtn->board = malloc( size * size * sizeof **g_rtn->board
+ size * sizeof *g_rtn->board );
Then the first row will start after the end of the row-pointers (a cast is necessary because we are converting int ** to int *, and using void * means we don't have to repeat the word int):
g_rtn->board[0] = (void *) (g_rtn->board + size);
And the other rows each have size ints in them:
for (int i = 1; i < size; ++i)
g_rtn->board[i] = g_rtn->board[i-1] + size;
Note that this is a whole lot more complicated than just using a 1-D array and doing arithmetic for the offsets, but it was stipulated that you must have two levels of indirection to access the board.
Also this is more complicated than the "canonical" version. In this version we are trading code complexity for the benefit of having a reduced number of mallocs. If your program typically only allocates one board, or a small number of boards, then perhaps this trade-off is not worth it and the canonical version would give you fewer headaches.
Finally - it would be possible to allocate both *g_rtn and the board in the single malloc, as you attempted to do in your question. However my advice (based on experience) is that it is simpler to keep the board separate. It makes your code clearer, and your object easier to use and make changes to, if the board is a separate allocation to the game object.

create_default_gol() misses to initialise board, so applying the [] operator to it (in main() ) the program accesses "invaid" memory and with ethis provokes undefined behaviour.
Although enough memory is allocated, the code still needs to make board point to the memory by doing
gol->board = ((char*) gol) + sizeof(*gol);
Update
As pointed out by Matt McNabb's comment board points to an array of pointers to int, so initialisation is more complicate:
gol * g_rtn = malloc(sizeof(*g_rtn) + 20 * sizeof(*gol->board));
g_rtn->board = ((char*) gol) + sizeof(*gol);
for (size_t i = 0; i<20; ++i)
{
g_rtn->board[i] = malloc(20 * sizeof(*g_rtn->board[i])
}
Also the code misses to set gol's member size. From what you tell us it is not clear whether it shall hold the nuber of bytes, rows/columns or fields.
Also^2 coding "magic numbers" like 20 is bad habit.
Also^3 create_default_gol does not specify any parameters, which explictily allows any numberm and not none as you might perhaps have expected.
All in all I'd code create_default_gol() like this:
gol * create_default_gol(const size_t rows, const size_t columns)
{
size_t size_rows = rows * sizeof(*g_rtn->board));
size_t size_column = columns * sizeof(**g_rtn->board));
gol * g_rtn = malloc(sizeof(*g_rtn) + size_rows);
g_rtn->board = ((char*) gol) + sizeof(*gol);
if (NULL ! = g_rtn)
{
for (size_t i = 0; i<columns; ++i)
{
g_rtn->board[i] = malloc(size_columns); /* TODO: Add error checking here. */
}
g_rtn->size = size_rows * size_columns; /* Or what ever this attribute is meant for. */
}
return g_rtn;
}

gol* create_default_gol()
{
int **a,i;
a = (int**)malloc(20 * sizeof(int *));
for (i = 0; i < 20; i++)
a[i] = (int*)malloc(20 * sizeof(int));
gol* g_rtn = (gol*)malloc(sizeof(*g_rtn));
g_rtn->board = a;
return g_rtn;
}
int main()
{
// Create a 20x20 game
gol* g_temp = create_default_gol();
int x,y;
for (x = 0; x < 20; x++)
{
for (y = 0; y < 20; y++)
{
g_temp->board[x][y] = 10;
}
}
for(x=0;x<20;x++)
free(g_temp->board[x]);
free(g_temp->board);
free(g_temp);
}

main (void)
{
gol* gameOfLife;
gameOfLife = create_default_gol();
free(gameOfLife);
}
gol* create_default_gol()
{
int size = 20;
gol* g_rtn = malloc(sizeof *g_rtn);
g_rtn = malloc(sizeof g_rtn);
g_rtn->size = size;
g_rtn->board = malloc(size * sizeof *g_rtn->board);
int i, b;
for (i = 0; i < size; ++i){
g_rtn->board[i] = malloc(sizeof (int) * size);
for(b=0;b<size;b++){
g_rtn->board[i][b] = 0;
}
}
return g_rtn;
}
Alternatively, since you also need to add a create_gol(size_t new_size) of custom size, you could also write it as the following.
main (void)
{
gol* gameOfLife;
gameOfLife = create_default_gol();
free(gameOfLife);
}
gol* create_default_gol()
{
size_t size = 20;
return create_gol(size);
}
gol* create_gol(size_t new_size)
{
gol* g_rtn = malloc(sizeof *g_rtn);
g_rtn = malloc(sizeof g_rtn);
g_rtn->size = new_size;
g_rtn->board = malloc(size * sizeof *g_rtn->board);
int i, b;
for (i = 0; i < size; ++i){
g_rtn->board[i] = malloc(sizeof (int) * size);
for(b=0;b<size;b++){
g_rtn->board[i][b] = 0;
}
}
return g_rtn;
}
Doing this just minimizes the amount of code needed.

Transferring data from 2d Dynamic array in C to CUDA and back

I have a dynamically declared 2D array in my C program, the contents of which I want to transfer to a CUDA kernel for further processing. Once processed, I want to populate the dynamically declared 2D array in my C code with the CUDA processed data. I am able to do this with static 2D C arrays but not with dynamically declared C arrays. Any inputs would be welcome!
I mean the dynamic array of dynamic arrays. The test code that I have written is as below.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <stdlib.h>
const int nItt = 10;
const int nP = 5;
__device__ int d_nItt = 10;
__device__ int d_nP = 5;
__global__ void arr_chk(float *d_x_k, float *d_w_k, int row_num)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
int index1 = (row_num * d_nP) + index;
if ( (index1 >= row_num * d_nP) && (index1 < ((row_num +1)*d_nP))) //Modifying only one row data pertaining to one particular iteration
{
d_x_k[index1] = row_num * d_nP;
d_w_k[index1] = index;
}
}
float **mat_create2(int r, int c)
{
float **dynamicArray;
dynamicArray = (float **) malloc (sizeof (float)*r);
for(int i=0; i<r; i++)
{
dynamicArray[i] = (float *) malloc (sizeof (float)*c);
for(int j= 0; j<c;j++)
{
dynamicArray[i][j] = 0;
}
}
return dynamicArray;
}
/* Freeing memory - here only number of rows are passed*/
void cleanup2d(float **mat_arr, int x)
{
int i;
for(i=0; i<x; i++)
{
free(mat_arr[i]);
}
free(mat_arr);
}
int main()
{
//float w_k[nItt][nP]; //Static array declaration - works!
//float x_k[nItt][nP];
// if I uncomment this dynamic declaration and comment the static one, it does not work.....
float **w_k = mat_create2(nItt,nP);
float **x_k = mat_create2(nItt,nP);
float *d_w_k, *d_x_k; // Device variables for w_k and x_k
int nblocks, blocksize, nthreads;
for(int i=0;i<nItt;i++)
{
for(int j=0;j<nP;j++)
{
x_k[i][j] = (nP*i);
w_k[i][j] = j;
}
}
for(int i=0;i<nItt;i++)
{
for(int j=0;j<nP;j++)
{
printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]);
printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]);
}
}
int size1 = nItt * nP * sizeof(float);
printf("\nThe array size in memory bytes is: %d\n",size1);
cudaMalloc( (void**)&d_x_k, size1 );
cudaMalloc( (void**)&d_w_k, size1 );
if((nP*nItt)<32)
{
blocksize = nP*nItt;
nblocks = 1;
}
else
{
blocksize = 32; // Defines the number of threads running per block. Taken equal to warp size
nthreads = blocksize;
nblocks = ceil(float(nP*nItt) / nthreads); // Calculated total number of blocks thus required
}
for(int i = 0; i< nItt; i++)
{
cudaMemcpy( d_x_k, x_k, size1,cudaMemcpyHostToDevice ); //copy of x_k to device
cudaMemcpy( d_w_k, w_k, size1,cudaMemcpyHostToDevice ); //copy of w_k to device
arr_chk<<<nblocks, blocksize>>>(d_x_k,d_w_k,i);
cudaMemcpy( x_k, d_x_k, size1, cudaMemcpyDeviceToHost );
cudaMemcpy( w_k, d_w_k, size1, cudaMemcpyDeviceToHost );
}
printf("\nVerification after return from gpu\n");
for(int i = 0; i<nItt; i++)
{
for(int j=0;j<nP;j++)
{
printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]);
printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]);
}
}
cudaFree( d_x_k );
cudaFree( d_w_k );
cleanup2d(x_k,nItt);
cleanup2d(w_k,nItt);
getch();
return 0;

I mean the dynamic array of dynamic arrays.
Well, that's exactly where the problem lies. A dynamic array of dynamic arrays consists of a whole bunch of disjoint memory blocks, one for each line in the array (as is clearly seen from the malloc inside you for loop in mat_create2). So you can't copy such a data structure to device memory with just one call to cudaMemcpy*. Instead, you have to do either
Also use dynamic arrays of dynamic arrays on CUDA. To do this, you have to basically recreate your mat_create2 function, using cudaMalloc instead of malloc, then copy each row seperately.
Use a "tight" 2d array on CUDA, like you do now (which is a good thing, at least performance-wise!). But if you keep using dyn-dyn-arrays on host memory, you still have copy each row seperately, like
for(int i=0; i<r; ++i){
cudaMemcpy(d_x_k + i*c, x_k[i], c*sizeof(float), cudaMemcpyHostToDevice)
}
You may wonder "why did it work with a static 2d array, then"? Well, static 2d arrays in C are proper, tight arrays that can be copied in one go. It's a bit confusing that these are indexed with exactly the same syntax as dyn-dyn arrays (arr[x][y]), because it actually works completely different.
But you should consider using tight arrays on host memory, too, perhaps with an object-oriented wrapper like
typedef struct {
float* data;
int n_rows, n_cols;
} tight2dFloatArray;
#define INDEX_TIGHT2DARRAY(arr, y, x)\
(arr).data[(y)*(arr).n_cols + (x)]
such an approach of course can be implemented much safer as a C++ class.
*You also can't copy it inside main memory with just one memcpy: that only copies the array of pointers, not the actual data.

2d array in C with negative indices

I am writing a C-program where I need 2D-arrays (dynamically allocated) with negative indices or where the index does not start at zero. So for an array[i][j] the row-index i should take values from e.g. 1 to 3 and the column-index j should take values from e.g. -1 to 9.
For this purpose I created the following program, here the variable columns_start is set to zero, so just the row-index is shifted and this works really fine.
But when I assign other values than zero to the variable columns_start, I get the message (from valgrind) that the command "free(array[i]);" is invalid.
So my questions are:
Why it is invalid to free the memory that I allocated just before?
How do I have to modify my program to shift the column-index?
Thank you for your help.
#include <stdio.h>
#include <stdlib.h>
main()
{
int **array, **array2;
int rows_end, rows_start, columns_end, columns_start, i, j;
rows_start = 1;
rows_end = 3;
columns_start = 0;
columns_end = 9;
array = malloc((rows_end-rows_start+1) * sizeof(int *));
for(i = 0; i <= (rows_end-rows_start); i++) {
array[i] = malloc((columns_end-columns_start+1) * sizeof(int));
}
array2 = array-rows_start; //shifting row-index
for(i = rows_start; i <= rows_end; i++) {
array2[i] = array[i-rows_start]-columns_start; //shifting column-index
}
for(i = rows_start; i <= rows_end; i++) {
for(j = columns_start; j <= columns_end; j++) {
array2[i][j] = i+j; //writing stuff into array
printf("%i %i %d\n",i, j, array2[i][j]);
}
}
for(i = 0; i <= (rows_end-rows_start); i++) {
free(array[i]);
}
free(array);
}

When you shift column indexes, you assign new values to original array of columns: in
array2[i] = array[i-rows_start]-columns_start;
array2[i] and array[i=rows_start] are the same memory cell as array2 is initialized with array-rows_start.
So deallocation of memory requires reverse shift. Try the following:
free(array[i] + columns_start);
IMHO, such modification of array indexes gives no benefit, while complicating program logic and leading to errors. Try to modify indexes on the fly in single loop.

#include <stdio.h>
#include <stdlib.h>
int main(void) {
int a[] = { -1, 41, 42, 43 };
int *b;//you will always read the data via this pointer
b = &a[1];// 1 is becoming the "zero pivot"
printf("zero: %d\n", b[0]);
printf("-1: %d\n", b[-1]);
return EXIT_SUCCESS;
}
If you don't need just a contiguous block, then you may be better off with hash tables instead.

As far as I can see, your free and malloc looks good. But your shifting doesn't make sense. Why don't you just add an offset in your array instead of using array2:
int maxNegValue = 10;
int myNegValue = -6;
array[x][myNegValue+maxNegValue] = ...;
this way, you're always in the positive range.
For malloc: you acquire (maxNegValue + maxPosValue) * sizeof(...)
Ok I understand now, that you need free(array.. + offset); even using your shifting stuff.. that's probably not what you want. If you don't need a very fast implementation I'd suggest to use a struct containing the offset and an array. Then create a function having this struct and x/y as arguments to allow access to the array.

I don't know why valgrind would complain about that free statement, but there seems to be a lot of pointer juggling going on so it doesn't surprise me that you get this problem in the first place. For instance, one thing which caught my eye is:
array2 = array-rows_start;
This will make array2[0] dereference memory which you didn't allocate. I fear it's just a matter of time until you get the offset calcuations wrong and run into this problem.
One one comment you wrote
but im my program I need a lot of these arrays with all different beginning indices, so I hope to find a more elegant solution instead of defining two offsets for every array.
I think I'd hide all this in a matrix helper struct (+ functions) so that you don't have to clutter your code with all the offsets. Consider this in some matrix.h header:
struct matrix; /* opaque type */
/* Allocates a matrix with the given dimensions, sample invocation might be:
*
* struct matrix *m;
* matrix_alloc( &m, -2, 14, -9, 33 );
*/
void matrix_alloc( struct matrix **m, int minRow, int maxRow, int minCol, int maxCol );
/* Releases resources allocated by the given matrix, e.g.:
*
* struct matrix *m;
* ...
* matrix_free( m );
*/
void matrix_free( struct matrix *m );
/* Get/Set the value of some elment in the matrix; takes logicaly (potentially negative)
* coordinates and translates them to zero-based coordinates internally, e.g.:
*
* struct matrix *m;
* ...
* int val = matrix_get( m, 9, -7 );
*/
int matrix_get( struct matrix *m, int row, int col );
void matrix_set( struct matrix *m, int row, int col, int val );
And here's how an implementation might look like (this would be matrix.c):
struct matrix {
int minRow, maxRow, minCol, maxCol;
int **elem;
};
void matrix_alloc( struct matrix **m, int minCol, int maxCol, int minRow, int maxRow ) {
int numRows = maxRow - minRow;
int numCols = maxCol - minCol;
*m = malloc( sizeof( struct matrix ) );
*elem = malloc( numRows * sizeof( *elem ) );
for ( int i = 0; i < numRows; ++i )
*elem = malloc( numCols * sizeof( int ) );
/* setting other fields of the matrix omitted for brevity */
}
void matrix_free( struct matrix *m ) {
/* omitted for brevity */
}
int matrix_get( struct matrix *m, int col, int row ) {
return m->elem[row - m->minRow][col - m->minCol];
}
void matrix_set( struct matrix *m, int col, int row, int val ) {
m->elem[row - m->minRow][col - m->minCol] = val;
}
This way you only need to get this stuff right once, in a central place. The rest of your program doesn't have to deal with raw arrays but rather the struct matrix type.