I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
fprintf(f, "%i", ptr[n-1]);
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
int1DExportCSV(b, DIM*DIM);
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.

No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.


How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4?

I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx.
However, I got the error "incorrect checksum for freed object - object was probably modified after being freed."
I think the error would generate if the matrix dimension isn't multiples of 4.
I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4.
But, here is my question.
How can I use AVX2 efficiently if the matrix isn't multiples of 4 ???
Here is my code.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "time.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
__m256d va,vb,vc;
int k;
int i;
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
int main(int argc, char* argv[]) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
int temp=0;
struct timespec startTime, endTime;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
for (i=0;i<m;i++) {
for (i=m;i<m*m;i++) {
// check start time
clock_gettime(CLOCK_REALTIME, &startTime);
mv(a, b, c, m, 1, m);
// check end time
clock_gettime(CLOCK_REALTIME, &endTime);
return 0;
You load and store vectors of 4 double, but your loop condition only checks that the first vector element is in-bounds, so you can write outside objects by up to 3x8 = 24 bytes when m is not a multiple of 4.
You need something like i < (m-3) in main loop, and a cleanup strategy for handling the last partial vector of data. Vectorizing with SIMD is very much like unrolling: you have to check that it's ok to do multiple future elements in the loop condition.
A scalar cleanup loop works well, but we can do better. For example, do as many 128-bit vectors as possible after the last full 256-bit vector (i.e. up to 1), before going scalar.
In many cases (e.g. write-only destination) an unaligned vector load that ends at the end of your arrays is very good (when m>=4). It can overlap with your main loop if m%4 != 0, but that's fine because your output array doesn't overlap your inputs, so redoing an element as part of a single cleanup is cheaper than branching to avoid it.
But that doesn't work here, because your logic is c[i+0..3] += ..., so redoing an element would make it wrong.
// cleanup using a 128-bit FMA, then scalar if there's an odd element.
// untested
void mv(double *a,double *b,double *c, int m, int n, int l)
/* the loop below should actually work for m=1..3, but a separate strategy might be good.
if (m < 4) {
// maybe check m >= 2 and use __m128 vectors?
// or vectorize differently?
for (int k = 0; k < l; k++) {
__m256 vb = _mm256_broadcast_sd(&b[k]);
int i;
for (i = 0; i < (m-3); i+=4) {
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
__m256d vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
if (i<(m-1)) {
__m128d lasta = _mm_loadu_pd(&a[m*k+i]);
__m128d lastc = _mm_loadu_pd(&c[i]);
lastc = _mm_fmadd_pd(lastc, va, _mm256_castpd256_pd128(vb));
_mm_storeu_pd( &c[i], lastc );
// i+=2; // last element only checks m odd/even, doesn't use i
// if (i<m)
if (m&1) {
// odd number of elements, do the last non-vector one
c[m-1] += a[m*k + m-1] * _mm256_cvtsd_f64(vb);
I haven't looked at exactly how gcc/clang -O3 compile that. Sometimes compilers try to get too smart with cleanup code (e.g. trying to auto-vectorize scalar cleanup loops).
Other strategies could include doing the last up-to-4 elements with an AVX masked store: you need the same mask for the end of every matrix row, so generating it once and then using it at the end of every row could be good. See Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. (To simplify branching, you'd set it up so your main loop only goes to i < (m-4), then you always run the cleanup. In the m%4 == 0 case, the mask is all-ones so you do the final full vector.) If you can't safely read past the end of the matrix, you probably need a masked load as well as masked store.
You could also look at aligning your rows for efficiency, or a row stride that's separate from the logical length of rows. (i.e. pad rows out to 32-byte boundaries). Leaving padding at the end of rows simplifies the cleanup, because you can always do whole vectors that write padding.
Special case m==2: instead of broadcasting one element from b[], you'd like to broadcast 2 elements into two 128-bit lanes of a __m256d, so one 256-bit FMA could do 2 rows at once.

Merge sort using CUDA: efficient implementation for small input arrays

I have the following problem: given two sorted arrays A and B, I have to produce a sorted array C with the elements of A and B.
I found some solution for solving this problem using CUDA: Merge Path, for example
However, their problem is given by the size of the original arrays, at least 10k elements. I have a different problem.
The arrays I've to merge are much smaller (1000 elements at most) and the complexity is given by the number of merges to be done (the order of 10 to the power of 10, 10^5 arrays of size ~1000 to be merged with each other).
Part of their problem is to split the original arrays into equally sized parts that are processed in parallel. The arrays I have to merge are small enough to entirely fit in the GPU shared memory.
Thrust is not my first choice because the output of my procedure is not the sorted array itself, but a calculation with its elements: so I think that a specialized kernel should be faster than first sort the element indices and then use them for the calculation.
A serial version of the algorithm looks like:
while i<N and j<M:
if A[i]<B[j]:
start_i = max(0,i-T)
start_j = max(0,j-T)
while i<N:
start_i = max(0,i-T)
while j<M:
start_j = max(0,j-T)
How can I exploit CUDA capabilities to solve this problem?
The two most important optimization goals for any CUDA program should be to:
expose (sufficient) parallelism
make efficient use of memory
There are certainly many other things that can be considered during optimization, but these are the two most important items to address first.
A merge operation (not quite the same as a merge-sort) is, at first glance, an inherently serial operation. We cannot make a proper decision about which item to choose, from either A or B input array, to place next in the output array C, until we have made all previous selections in C. In this respect, the merge algorithm (in this realization) makes it difficult to expose parallelism, and the paper linked in the question is almost entirely focused on that topic.
The goal of the algorithm described in the paper is to decompose the two input vectors A and B into multiple smaller pieces that can be worked on independently, so as to expose parallelism. In particular, the goal is to keep all the SMs in a GPU busy, and keep all the SP's in an SM busy. Once a sufficient decomposition of work is performed, each SP is ultimately performing a sequential merge (as mentioned in the paper):
Merging stage - Each core merges the two sub arrays
that it has been given using the same algorithm as a
simple sequential merge.
However, as you've pointed out, what you want to do is somewhat different. You already have many arrays, and you want to perform independent merge operations on those many arrays. Since your array count is ~100,000, this is enough independent pieces of work to consider mapping each to a GPU SP (ie. thread). This means that we can then, just as in the paper, use a simple sequential merge on each core/SP/thread. So the problem of exposing parallelism is in your case, already done (to perhaps a sufficient degree).
At this point we could consider implementing this as-is. The code I show later offers this as a starting point for comparison. However what we discover is the performance is not very good, and this is due to the fact that a merge algorithm fundamentally has a data-dependent access sequence, and so it is (more) difficult to arrange for coalesced access on the GPU. The authors of the paper propose to mitigate this problem by first reading the data (in a coalesced fashion) into shared memory, and then having the algorithm work on it out of shared memory, where the penalty for disorganized access is less.
I'll propose a different approach:
arrange the sequential merge algorithm so that each element of A and B need only be read once
arrange the storage of A, B, and C in column-major form as opposed to the more "natural" row-major storage that one might consider. This is effectively transposing the storage matrices for A, B, and C vectors. This allows for an improvement in coalesced access, as the GPU threads navigate their way through the merging operation on their individual A and B vectors. It's far from perfect, but the improvement is substantial.
Here's a worked example that implements the above idea, running a simple sequential merge in each thread, and each thread merging one of the A vectors with one of the B vectors:
$ cat
#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <thrust/merge.h>
#define NUM_SETS 100000
#define DSIZE 100
typedef int mytype;
// for ascending sorted data
#define cmp(A,B) ((A)<(B))
#define nTPB 512
#define nBLK 128
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
template <typename T>
__host__ __device__ void smerge(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, const unsigned len_a, const unsigned len_b, const unsigned stride_a = 1, const unsigned stride_b = 1, const unsigned stride_c = 1){
unsigned len_c = len_a+len_b;
unsigned nc = 0;
unsigned na = 0;
unsigned nb = 0;
unsigned fa = (len_b == 0);
unsigned fb = (len_a == 0);
T nxta = a[0];
T nxtb = b[0];
while (nc < len_c){
if (fa) {c[stride_c*nc++] = nxta; na++; nxta = a[stride_a*na];}
else if (fb) {c[stride_c*nc++] = nxtb; nb++; nxtb = b[stride_b*nb];}
else if (cmp(nxta,nxtb)){
c[stride_c*nc++] = nxta;
if (na == len_a) fb++;
else nxta = a[stride_a*na];}
else {
c[stride_c*nc++] = nxtb;
if (nb == len_b) fa++;
else nxtb = b[stride_b*nb];}}
template <typename T>
__global__ void rmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
int sel=idx*len;
smerge(a+sel, b+sel, c+(2*sel), len, len);
idx += blockDim.x*gridDim.x;}
template <typename T>
__global__ void cmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len, int stride_a, int stride_b, int stride_c){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
smerge(a+idx, b+idx, c+idx, len, len, stride_a, stride_b, stride_c);
idx += blockDim.x*gridDim.x;}
template <typename T>
int rmvalidate(T *a, T *b, T *c, int num_arr, int len){
T *vc = (T *)malloc(2*len*sizeof(T));
for (int i = 0; i < num_arr; i++){
thrust::merge(a+(i*len), a+((i+1)*len), b+(i*len), b+((i+1)*len), vc);
#ifndef TIMING
for (int j = 0; j < len*2; j++)
if (vc[j] != c[(i*2*len)+j]) {printf("rm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c[(i*2*len)+j], vc[j]); return 0;}
return 1;
template <typename T>
int cmvalidate(const T *c1, const T *c2, int num_arr, int len){
for (int i = 0; i < num_arr; i++)
for (int j = 0; j < 2*len; j++)
if (c1[i*(2*len)+j] != c2[j*(num_arr)+i]) {printf("cm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c2[j*(num_arr)+i], c1[i*(2*len)+j]); return 0;}
return 1;
int main(){
mytype *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
h_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype)*2);
cudaMalloc(&d_a, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_b, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_c, DSIZE*NUM_SETS*sizeof(mytype)*2);
// test "row-major" storage
for (int i =0; i<DSIZE*NUM_SETS; i++){
h_a[i] = rand();
h_b[i] = rand();}
thrust::sort(h_a, h_a+DSIZE*NUM_SETS);
thrust::sort(h_b, h_b+DSIZE*NUM_SETS);
cudaMemcpy(d_a, h_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long gtime = dtime_usec(0);
rmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE);
gtime = dtime_usec(gtime);
cudaMemcpy(h_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
unsigned long ctime = dtime_usec(0);
if (!rmvalidate(h_a, h_b, h_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
ctime = dtime_usec(ctime);
printf("CPU time: %f, GPU RM time: %f\n", ctime/(float)USECPSEC, gtime/(float)USECPSEC);
// test "col-major" storage
mytype *ch_a, *ch_b, *ch_c;
ch_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
for (int i = 0; i < NUM_SETS; i++)
for (int j = 0; j < DSIZE; j++){
ch_a[j*NUM_SETS+i] = h_a[i*DSIZE+j];
ch_b[j*NUM_SETS+i] = h_b[i*DSIZE+j];}
cudaMemcpy(d_a, ch_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, ch_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
gtime = dtime_usec(0);
cmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE, NUM_SETS, NUM_SETS, NUM_SETS );
gtime = dtime_usec(gtime);
cudaMemcpy(ch_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
if (!cmvalidate(h_c, ch_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
printf("GPU CM time: %f\n", gtime/(float)USECPSEC);
return 0;
$ nvcc -O3 -DTIMING -o t784
$ ./t784
CPU time: 0.030691, GPU RM time: 0.045814
GPU CM time: 0.002784
The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.
using int vs. float datatypes makes a significant difference in the CPU timing. int seems to be faster (on the CPU) so I'm showing that version here. (This disparity is mentioned in the paper as well.)
The -DTIMING switch added to the compile command pares down the first validation function so that it just does the CPU merge operation, for timing.
The basic merge code is templated to be able to handle different data types, and parameterized so that it can used in either the column-major or row major operation.
I've dispensed with CUDA error checking for brevity of presenation. However, any time you're having trouble with a CUDA code, you should always use proper cuda error checking.
What about using thrust (as I suggested in the comments)? It should be possible to use thrust::merge with a suitable device/sequential execution policy, to more or less mimic what I have done above. However, thrust expects vectors to be contiguous, and so, without additional complexity, it could only be used in the row-major case, which we've seen is severely penalized by bad memory access patterns. It should be possible to create a set of permutation iterators in thrust that would allow the column-major, strided access that improves the memory scenario, but I have not pursued that.

Problems with very simple tutorial in Cuda by Example

I'm studying Cuda C on the "Cuda by Example" book. At chapter 4 there's a very simple tutorial about how to sum 2 vectors.
I basically copied the tutorial:
`#include <stdio.h>
#include <stdlib.h>
#define N 5
__global__ void Add(int *a, int*b, int *c){
int i = blockIdx.x;
c[i] = a[i] + b[i];
int main(){
int a[N] = {1,2,3,4,5}, b[N] = {5,6,7,8,9};
int c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**)&dev_a, N*sizeof(int));
cudaMalloc((void**)&dev_b, N*sizeof(int));
cudaMalloc((void**)&dev_c, N*sizeof(int));
cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);
Add<<<2,1>>>(dev_a, dev_b, dev_c); // HERE IS THE CRITICAL LINE !!!!!!
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);
int i; printf("c[i] = ");
printf("%d ", c[i]);
return 0;
So according to the book, the parameter N in the line Add<<<N,1>>> is the one who tells the device to split the operations (contained in the Add function) into N blocks; the index i defined in each block assumes a value that goes from 0 to N so that just a single operation is run by each block simultaneously (parallel computing).
Here's the problem: if i type a random number (1 or 2 or 3 or 0 and so on) instead of N (for example Add<<<2,1>>>), the program keeps giving me the sum of all the elements of the vector while it should stop to the first or second or third accordingly to the number i typed instead of N... Why do i keep getting the same result all the time? should the number of elements vary depending on the number of blocks i desire?
hopefully i made myself clear and if you don't understand let me know
You may want to initialize dev_c to a known state, e.g. to all zeros. If you ran your kernel with N threads at some point, the global memory can still contain the previous results, and the same physical region can be allocated as dev_c over and over again.
For example, add the following lines:
int c[N] = {0,0,0,0,0};
cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);
One more thing to try is to add printf to the kernel, and observe the output.

Allocate contiguous memory

I'm trying to allocate a large space of contiguous memory in C and print this out to the user. My strategy for doing this is to create two pointers (one a pointer to double, one a pointer to pointer to double), malloc one of them to the entire size (m * n) in this case the pointer to pointer to double. Then malloc the second one to the size of m. The last step will be to iterate through the size of m and perform pointer arithmetic that would ensure the addresses of the doubles in the large array will be stored in contiguous memory. Here is my code. But when I print out the address it doesn't seem to be in contiguous (or in any sort of order). How do i print out the memory addresses of the doubles (all of them are of value 0.0) correctly?
/* correct solution, with correct formatting */
/*The total number of bytes allocated was: 4
0x7fd5e1c038c0 - 1
0x7fd5e1c038c8 - 2
0x7fd5e1c038d0 - 3
0x7fd5e1c038d8 - 4*/
double **dmatrix(size_t m, size_t n);
int main(int argc, char const *argv[])
int m,n,i;
double ** f;
m = n = 2;
i = 0;
f = dmatrix(sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n);
for (i=0;i<n*m;++i) {
printf("%p - %d\n ", &f[i], i + 1);
return 0;
double **dmatrix(size_t m, size_t n) {
double ** ptr1 = (double **)malloc(sizeof(double *) * m * n);
double * ptr2 = (double *)malloc(sizeof(double) * m);
int i;
for (i = 0; i < n; i++){
ptr1[i] = ptr2+m*i;
return ptr1;
Remember that memory is just memory. Sounds trite, but so many people seem to think of memory allocation and memory management in C as being some magic-voodoo. It isn't. At the end of the day you allocate whatever memory you need, and free it when you're done.
So start with the most basic question: If you had a need for 'n' double values, how would you allocate them?
double *d1d = calloc(n, sizeof(double));
// ... use d1d like an array (d1d[0] = 100.00, etc. ...
Simple enough. Next question, in two parts, where the first part has nothing to do with memory allocation (yet):
How many double values are in a 2D array that is m*n in size?
How can we allocate enough memory to hold them all.
There are m*n doubles in a m*n 2D-matrix of doubles
Allocate enough memory to hold (m*n) doubles.
Seems simple enough:
size_t m=10;
size_t n=20;
double *d2d = calloc(m*n, sizeof(double));
But how do we access the actual elements? A little math is in order. Knowing m and n, you can simple do this
size_t i = 3; // value you want in the major index (0..(m-1)).
size_t j = 4; // value you want in the minor index (0..(n-1)).
d2d[i*n+j] = 100.0;
Is there a simpler way to do this? In standard C, yes; in C++ no. Standard C supports a very handy capability that generates the proper code to declare dynamically-sized indexible arrays:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
Can't stress this enough: Standard C supports this, C++ does NOT. If you're using C++ you may want to write an object class to do this all for you anyway, so it won't be mentioned beyond that.
So what does the above actual do ? Well first, it should be obvious we are still allocating the same amount of memory we were allocating before. That is, m*n elements, each sizeof(double) large. But you're probably asking yourself,"What is with that variable declaration?" That needs a little explaining.
There is a clear and present difference between this:
double *ptrs[n]; // declares an array of `n` pointers to doubles.
and this:
double (*ptr)[n]; // declares a pointer to an array of `n` doubles.
The compiler is now aware of how wide each row is (n doubles in each row), so we can now reference elements in the array using two indexes:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
d2d[2][5] = 100.0; // does the 2*n+5 math for you.
Can we extend this to 3D? Of course, the math starts looking a little weird, but it is still just offset calculations into a big'ol'block'o'ram. First the "do-your-own-math" way, indexing with [i,j,k]:
size_t l=10;
size_t m=20;
size_t n=30;
double *d3d = calloc(l*m*n, sizeof(double));
size_t i=3;
size_t j=4;
size_t k=5;
d3d[i*m*n + j*m + k] = 100.0;
You need to stare at the math in that for a minute to really gel on how it computes where the double value in that big block of ram actually is. Using the above dimensions and desired indexes, the "raw" index is:
i*m*n = 3*20*30 = 1800
j*m = 4*20 = 80
k = 5 = 5
i*m*n+j*m+k = 1885
So we're hitting the 1885'th element in that big linear block. Lets do another. what about [0,1,2]?
i*m*n = 0*20*30 = 0
j*m = 1*20 = 20
k = 2 = 2
i*m*n+j*m+k = 22
I.e. the 22nd element in the linear array.
It should be obvious by now that so long as you stay within the self-prescribed bounds of your array, i:[0..(l-1)], j:[0..(m-1)], and k:[0..(n-1)] any valid index trio will locate a unique value in the linear array that no other valid trio will also locate.
Finally, we use the same array pointer declaration like we did before with a 2D array, but extend it to 3D:
size_t l=10;
size_t m=20;
size_t n=30;
double (*d3d)[m][n] = calloc(l, sizeof(*d3d));
d3d[3][4][5] = 100.0;
Again, all this really does is the same math we were doing before by hand, but letting the compiler do it for us.
I realize is may be a bit much to wrap your head around, but it is important. If it is paramount you have contiguous memory matrices (like feeding a matrix to a graphics rendering library like OpenGL, etc), you can do it relatively painlessly using the above techniques.
Finally, you might wonder why would anyone do the whole pointer arrays to pointer arrays to pointer arrays to values thing in the first place if you can do it like this? A lot of reasons. Suppose you're replacing rows. swapping a pointer is easy; copying an entire row? expensive. Supposed you're replacing an entire table-dimension (m*n) in your 3D array (l*n*m), even more-so, swapping a pointer: easy; copying an entire m*n table? expensive. And the not-so-obvious answer. What if the rows widths need to be independent from row to row (i.e. row0 can be 5 elements, row1 can be 6 elements). A fixed l*m*n allocation simply doesn't work then.
Best of luck.
Never mind, I figured it out.
/* The total number of bytes allocated was: 8
0x7fb35ac038c0 - 1
0x7fb35ac038c8 - 2
0x7fb35ac038d0 - 3
0x7fb35ac038d8 - 4
0x7fb35ac038e0 - 5
0x7fb35ac038e8 - 6
0x7fb35ac038f0 - 7
0x7fb35ac038f8 - 8 */
double ***d3darr(size_t l, size_t m, size_t n);
int main(int argc, char const *argv[])
int m,n,l,i;
double *** f;
m = n = l = 10; i = 0;
f = d3darr(sizeof(l), sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n * l);
for (i=0;i<n*m*l;++i) {
printf("%p - %d\n ", &f[i], i + 1);
return 0;
double ***d3darr(size_t l, size_t m, size_t n){
double *** ptr1 = (double ***)malloc(sizeof(double **) * m * n * l);
double ** ptr2 = (double **)malloc(sizeof(double *) * m * n);
double * ptr3 = (double *)malloc(sizeof(double) * m);
int i, j;
for (i = 0; i < l; ++i) {
ptr1[i] = ptr2+m*n*i;
for (j = 0; j < l; ++j){
ptr2[i] = ptr3+j*n;
return ptr1;

Strange behaviour of an elementary CUDA code.

I am having trouble understanding the output of the following simple CUDA code. All that the code does is allocate two integer arrays: one on the host and one on the device each of size 16. It then sets the device array elements to the integer value 3 and then copies these values into the host_array where all the elements are then printed out.
#include <stdlib.h>
#include <stdio.h>
int main(void)
int num_elements = 16;
int num_bytes = num_elements * sizeof(int);
int *device_array = 0;
int *host_array = 0;
// malloc host memory
host_array = (int*)malloc(num_bytes);
// cudaMalloc device memory
cudaMalloc((void**)&device_array, num_bytes);
// Constant out the device array with cudaMemset
cudaMemset(device_array, 3, num_bytes);
// copy the contents of the device array to the host
cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);
// print out the result element by element
for(int i = 0; i < num_elements; ++i)
printf("%i\n", *(host_array+i));
// use free to deallocate the host array
// use cudaFree to deallocate the device array
return 0;
The output of this program is 50529027 printed line by line 16 times.
Where did this number come from? When I replace 3 with 0 in the cudaMemset call then I get correct behaviour. i.e.
0 printed line by line 16 times.
I compiled the code with nvcc on Ubuntu 10.10 with CUDA 4.0
I'm no cuda expert but 50529027 is 0x03030303 in hex. This means cudaMemset sets each byte in the array to 3 and not each int. This is not surprising given the signature of cuda memset (to pass in the number of bytes to set) and the general semantics of memset operations.
Edit: As to your (I guess) implicit question of how to achieve what you intended I think you have to write a loop and initialize each array element.
As others have pointed out, cudaMesetworks like the standard C memset- it sets byte values. From the CUDA documentation:
cudaError_t cudaMemset( void * devPtr, int value, size_t count)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
If you want to set word size values, the best solution is to use your own memset kernel, perhaps something like this:
template<typename T>
__global__ void myMemset(T * x, T value, size_t count )
size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
size_t stride = blockDim.x * gridDim.x;
for(int i=tid; i<count; i+=stride) {
x[i] = value;
which could be launched with enough blocks to cover the number of MP in your GPU, and each thread will do as many iterations as required to fill the memory allocation. Writes will be coalesced, so performance shouldn't be too bad. This could also be adapted to CUDA's vector types, if you so desired.
memset sets bytes, and integer is 4 bytes.. so what you get is 50529027 decimal, which is 0x3030303 in hex... In other words - you are using it wrong, and it has nothing to do with CUDA.
This is a classic memset shortcoming; it works only on data type with 8-bit size i.e char. This means it sets (probably) 3 to every 8-bits of the total memory. You can confirm this by a simple C++ code:
int main ()
int x=16;
size_t bytes = x*sizeof(int);
int *M = (int*)malloc(bytes);
for (int i = 0; i < x; ++i) {
printf("%d\n", M[i]);
return 0;
The only case in which memset works on all data types is when you set it to 0. (it sets every byte to 0 and hence all data to 0). If you change the data type to char, you'll see the desired output. cudaMemset is ditto copy of memset with the only difference that it takes a GPU pointer in input.
So memset or cudaMemset probably sets every byte to the integer value (in your case 3) of whole memory space defined by the third argument regardless of the datatype.
Google: 50529027 in binary and you'll get the answer :)
