Unspecified launch failure - parallel scan in CUDA - c

I am using GeForce GT 520 (compute capablility v2.1) to run a program that performs the scan operation on an array of int elements. Here's the code:
This is an implementation of the parallel scan algorithm.
Only a single block of threads is used. Maximum array size = 2048
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#define errorCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s, file: %s line: %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
__global__ void blelloch_scan(int* d_in, int* d_out, int n)
extern __shared__ int temp[];// allocated on invocation
int thid = threadIdx.x;
int offset = 1;
temp[2*thid] = d_in[2*thid]; // load input into shared memory
temp[2*thid+1] = d_in[2*thid+1];
// build sum in place up the tree
for (int d = n>>1; d > 0; d >>= 1)
if (thid < d)
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
temp[bi] += temp[ai];
offset *= 2;
// clear the last element
if (thid == 0)
temp[n - 1] = 0;
// traverse down tree & build scan
for (int d = 1; d < n; d *= 2)
offset >>= 1;
if (thid < d)
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
int t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
d_out[2*thid] = temp[2*thid]; // write results to device memory
d_out[2*thid+1] = temp[2*thid+1];
int main(int argc, char **argv)
if(argc != 2)
printf("Input Syntax: ./a.out <number-of-elements>\nProgram terminated.\n");
exit (1);
ARRAY_SIZE = (int) atoi(*(argv+1));
int *h_in, *h_out, *d_in, *d_out, i;
h_in = (int *) malloc(sizeof(int) * ARRAY_SIZE);
h_out = (int *) malloc(sizeof(int) * ARRAY_SIZE);
cudaDeviceProp devProps;
if (cudaGetDeviceProperties(&devProps, 0) == 0)
printf("Using device %d:\n", 0);
printf("%s; global mem: %dB; compute v%d.%d; clock: %d kHz\n",
devProps.name, (int)devProps.totalGlobalMem,
(int)devProps.major, (int)devProps.minor,
for(i = 0; i < ARRAY_SIZE; i++)
h_in[i] = i;
errorCheck(cudaMalloc((void **) &d_in, sizeof(int) * ARRAY_SIZE));
errorCheck(cudaMalloc((void **) &d_out, sizeof(int) * ARRAY_SIZE));
errorCheck(cudaMemcpy(d_in, h_in, ARRAY_SIZE * sizeof(int), cudaMemcpyHostToDevice));
blelloch_scan <<<1, ARRAY_SIZE / 2, sizeof(int) * ARRAY_SIZE>>> (d_in, d_out, ARRAY_SIZE);
errorCheck(cudaMemcpy(h_out, d_out, ARRAY_SIZE * sizeof(int), cudaMemcpyDeviceToHost));
for(i = 0; i < ARRAY_SIZE; i++)
printf("h_in[%d] = %d, h_out[%d] = %d\n", i, h_in[i], i, h_out[i]);
return 0;
On compiling using nvcc -arch=sm_21 parallel-scan.cu -o parallel-scan, I get an error:
GPUassert: unspecified launch failure, file: parallel-scan-single-block.cu line: 106
Line 106 is the line after kernel launch when we check for errors using errorCheck.
This is what I am planning to implement:
From the kernel, it can be seen that if a block has 1000 threads, it can operate on 2000 elements. Therefore, blockSize = ARRAY_SIZE / 2.
And, shared memory = sizeof(int) * ARRAY_SIZE
Everything is loaded into shared mem. Then, up sweep is done, with last element being set to 0. Finally, down sweep is done to give an exclusive scan of the elements.
I have used this file as the reference to write this code. I do not understand what's the mistake in my code. Any help would be greatly appreciated.

You are launching the kernel like so
blelloch_scan <<<1, ARRAY_SIZE / 2, sizeof(int) * ARRAY_SIZE>>>
meaning that witihin then kernel 0 < thid < int(ARRAY_SIZE/2).
However, your kernel requires a minimum of (2 * int(ARRAY_SIZE/2)) + 1 words of available shared memory to work correctly, otherwise this:
temp[2*thid+1] = d_in[2*thid+1];
will produce an out-of-bounds shared memory access.
If my integer mathematical skillz are not too rusty, this should mean that the code will be safe if ARRAY_SIZE is odd, because ARRAY_SIZE == (2 * int(ARRAY_SIZE/2)) + 1 for any odd integer. However, if ARRAY_SIZE is even, then ARRAY_SIZE < (2 * int(ARRAY_SIZE/2)) + 1 and you have a problem.
It might be that shared memory page size granularity saves you for some even values of ARRAY_SIZE which should theoretically fail, because the hardware will always round up the dynamic shared memory allocation to the next page size larger than the request size. But there should be a number of even values of ARRAY_SIZE for which this fails.
I can't comment on whether the rest of the kernel is correct or not, but using a shared memory size of sizeof(int) * size_t(1 + ARRAY_SIZE) should make this particular problem go away.


Segmentation Fault 11 in C caused by larger operation numbers

I have known that when encountered with segmentation fault 11, it means the program has attempted to access an area of memory that it is not allowed to access.
Here I am trying to calculate a Fourier transform, using the following code.
It works well when nPoints = 2^15 (or of course with less points) , however it corrupts when I further increase the points to 2^16. I am wondering, is that caused by occupying too much memory? But I did not notice too much memory occupation during the operation. And although it use recursion, it transforms in-place. I thought it would occupy not so much memory. Then, where's the problem?
Thanks in advance
PS: one thing I forgot to say is, the result above was on Max OS (8G memory).
When I running the code on Windows (16G memory), it corrupts when nPoints = 2^14. So it makes me confused whether it's caused by the memory allocation, as the Windows PC has a larger memory (but it's really hard to say, because the two operation systems utilize different memory strategy).
#include <stdio.h>
#include <tgmath.h>
#include <string.h>
// in place FFT with O(n) memory usage
long double PI;
typedef long double complex cplx;
void _fft(cplx buf[], cplx out[], int n, int step)
if (step < n) {
_fft(out, buf, n, step * 2);
_fft(out + step, buf + step, n, step * 2);
for (int i = 0; i < n; i += 2 * step) {
cplx t = exp(-I * PI * i / n) * out[i + step];
buf[i / 2] = out[i] + t;
buf[(i + n)/2] = out[i] - t;
void fft(cplx buf[], int n)
cplx out[n];
for (int i = 0; i < n; i++) out[i] = buf[i];
_fft(buf, out, n, 1);
int main()
const int nPoints = pow(2, 15);
PI = atan2(1.0l, 1) * 4;
double tau = 0.1;
double tSpan = 12.5;
long double dt = tSpan / (nPoints-1);
long double T[nPoints];
cplx At[nPoints];
for (int i = 0; i < nPoints; ++i)
T[i] = dt * (i - nPoints / 2);
At[i] = exp( - T[i]*T[i] / (2*tau*tau));
fft(At, nPoints);
return 0;
You cannot allocate very large arrays in the stack. The default stack size on macOS is 8 MiB. The size of your cplx type is 32 bytes, so an array of 216 cplx elements is 2 MiB, and you have two of them (one in main and one in fft), so that is 4 MiB. That fits on the stack, but, at that size, the program runs to completion when I try it. At 217, it fails, which makes sense because then the program has two arrays taking 8 MiB on stack. The proper way to allocate such large arrays is to include <stdlib.h> and use cmplx *At = malloc(nPoints * sizeof *At); followed by if (!At) { /* Print some error message about being unable to allocate memory and terminate the program. */ }. You should do that for At, T, and out. Also, when you are done with each array, you should free it, as with free(At);.
To calculate an integer power of two, use the integer operation 1 << power, not the floating-point operation pow(2, 16). We have designed pow well on macOS, but, on other systems, it may return approximations even when exact results are possible. An approximate result may be slightly less than the exact integer value, so converting it to an integer truncates to the wrong result. If it may be a power of two larger than suitable for an int, then use (type) 1 << power, where type is a suitably large integer type.
the following, instrumented, code clearly shows that the OPs code repeatedly updates the same locations in the out[] array and actually does not update most of the locations in that array.
#include <stdio.h>
#include <tgmath.h>
#include <assert.h>
// in place FFT with O(n) memory usage
#define N_POINTS (1<<15)
double T[N_POINTS];
double At[N_POINTS];
double PI;
// prototypes
void _fft(double buf[], double out[], int step);
void fft( void );
int main( void )
PI = 3.14159;
double tau = 0.1;
double tSpan = 12.5;
double dt = tSpan / (N_POINTS-1);
for (int i = 0; i < N_POINTS; ++i)
T[i] = dt * (i - (N_POINTS / 2));
At[i] = exp( - T[i]*T[i] / (2*tau*tau));
return 0;
void fft()
double out[ N_POINTS ];
for (int i = 0; i < N_POINTS; i++)
out[i] = At[i];
_fft(At, out, 1);
void _fft(double buf[], double out[], int step)
printf( "step: %d\n", step );
if (step < N_POINTS)
_fft(out, buf, step * 2);
_fft(out + step, buf + step, step * 2);
for (int i = 0; i < N_POINTS; i += 2 * step)
double t = exp(-I * PI * i / N_POINTS) * out[i + step];
buf[i / 2] = out[i] + t;
buf[(i + N_POINTS)/2] = out[i] - t;
printf( "index: %d buf update: %d, %d\n", i, i/2, (i+N_POINTS)/2 );
Suggest running via (where untitled1 is the name of the executable and on linux)
./untitled1 > out.txt
less out.txt
the out.txt file is 8630880 bytes
An examination of that file shows the lack of coverage and shows that any one entry is NOT the sum of the prior two entries, so I suspect this is not a valid Fourier transform,

Negative array indexing in shared memory based 1d stencil CUDA implementation

I'm currently working with CUDA programming and I'm trying to learn off of slides from a workshop I found online, which can be found here. The problem I am having is on slide 48. The following code can be found there:
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
To add a bit of context. We have an array called in which as length say N. We then have another array out which has length N+(2*RADIUS), where RADIUS has a value of 3 for this particular example. The idea is to copy array in, into array out but to place the array in in position 3 from the beginning of array out i.e out = [RADIUS][in][RADIUS], see slide for graphical representation.
The confusion comes in on the following line:
temp[lindex - RADIUS] = in[gindex - RADIUS];
If gindex is 0 then we have in[-3]. How can we read from a negative index in an array? Any help would really be appreciated.
The answer by pQB is correct. You are supposed to offset the input array pointer by RADIUS.
To show this, I'm providing below a full worked example. Hope it would be beneficial to other users.
(I would say you would need a __syncthreads() after the shared memory loads. I have added it in the below example).
#include <thrust/device_vector.h>
#define RADIUS 3
#define BLOCKSIZE 32
/* iDivUp FUNCTION */
int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
/* KERNEL */
__global__ void moving_average(unsigned int *in, unsigned int *out, unsigned int N) {
__shared__ unsigned int temp[BLOCKSIZE + 2 * RADIUS];
unsigned int gindexx = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int lindexx = threadIdx.x + RADIUS;
// --- Read input elements into shared memory
temp[lindexx] = (gindexx < N)? in[gindexx] : 0;
if (threadIdx.x < RADIUS) {
temp[threadIdx.x] = (((gindexx - RADIUS) >= 0)&&(gindexx <= N)) ? in[gindexx - RADIUS] : 0;
temp[threadIdx.x + (RADIUS + BLOCKSIZE)] = ((gindexx + BLOCKSIZE) < N)? in[gindexx + BLOCKSIZE] : 0;
// --- Apply the stencil
unsigned int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++) {
result += temp[lindexx + offset];
// --- Store the result
out[gindexx] = result;
/* MAIN */
int main() {
const unsigned int N = 55 + 2 * RADIUS;
const unsigned int constant = 4;
thrust::device_vector<unsigned int> d_in(N, constant);
thrust::device_vector<unsigned int> d_out(N);
moving_average<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(thrust::raw_pointer_cast(d_in.data()), thrust::raw_pointer_cast(d_out.data()), N);
thrust::host_vector<unsigned int> h_out = d_out;
for (int i=0; i<N; i++)
printf("Element i = %i; h_out = %i\n", i, h_out[i]);
return 0;
You are assuming that in array points to the first position of the memory that has been allocated for this array. However, if you see slide 47, the in array has a halo (orange boxes) of three elements before and after of the data (represented as green cubes).
My assumption is (I have not done the workshop) that the input array is first initialized with an halo and then the pointer is moved in the kernel call. Something like:
stencil_1d<<<dimGrid, dimBlock>>>(in + RADIUS, out);
So, in the kernel, it's safe to do in[-3] because the pointer is not at the beginning of the array.
There are already good answers, but to focus on the actual point that caused the confusion:
In C (not only in CUDA, but in C in general), when you access an "array" by using the [ brackets ], you are actually doing pointer arithmetic.
For example, consider a pointer like this:
int* data= ... // Points to some memory
When you then write a statement like
data[3] = 42;
you are just accessing a memory location that is "three entries behind the original data pointer". So you could also have written
int* data= ... // Points to some memory
int* dataWithOffset = data+3;
dataWithOffset[0] = 42; // This will write into data[3]
and consequently,
dataWithOffset[-3] = 123; // This will write into data[0]
In fact, you can say that data[i] is the same as *(data+i), which is the same as *(i+data), which in turn is the same as i[data], but you should not use this in real programs...)
I can compile #JackOLantern's code, but there is an warning: "pointless comparison of unsigned integer with zero":
And when run, it will abort like:
I have modified the code to the following and the warning disappeared and it can get right result:
#include <thrust/device_vector.h>
#define RADIUS 3
#define BLOCKSIZE 32
/* iDivUp FUNCTION */
int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
/* KERNEL */
__global__ void moving_average(unsigned int *in, unsigned int *out, int N) {
__shared__ unsigned int temp[BLOCKSIZE + 2 * RADIUS];
int gindexx = threadIdx.x + blockIdx.x * blockDim.x;
int lindexx = threadIdx.x + RADIUS;
// --- Read input elements into shared memory
temp[lindexx] = (gindexx < N)? in[gindexx] : 0;
if (threadIdx.x < RADIUS) {
temp[threadIdx.x] = (((gindexx - RADIUS) >= 0)&&(gindexx <= N)) ? in[gindexx - RADIUS] : 0;
temp[threadIdx.x + (RADIUS + BLOCKSIZE)] = ((gindexx + BLOCKSIZE) < N)? in[gindexx + BLOCKSIZE] : 0;
// --- Apply the stencil
unsigned int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++) {
result += temp[lindexx + offset];
// --- Store the result
out[gindexx] = result;
/* MAIN */
int main() {
const int N = 55 + 2 * RADIUS;
const unsigned int constant = 4;
thrust::device_vector<unsigned int> d_in(N, constant);
thrust::device_vector<unsigned int> d_out(N);
moving_average<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(thrust::raw_pointer_cast(d_in.data()), thrust::raw_pointer_cast(d_out.data()), N);
thrust::host_vector<unsigned int> h_out = d_out;
for (int i=0; i<N; i++)
printf("Element i = %i; h_out = %i\n", i, h_out[i]);
return 0;
The result is like this:

Dynamically allocating, and filling up a variable, via a method in C

I'm having a great deal of difficulty with pointers. What I am trying to accomplish sounds, to my ears, rather simple: I want to define a multi-dimensional char array, but not its size, and then have a second method allocate the necessary memory, and fill it up with the requested data.
Now, I've tried for countless hours to accomplish this, searched with Google until my eyes were dry and I still haven't been able to fix it. As such I was hoping any of you had any insight how this would be possible.
What I am imagining, is to define a pointer char** files, and a counter int total_files that will be used my the method print_files(). Print files will then calloc and malloc the variable, and then we'll fill it up with relevant data.
Now in the code below, I have attempted this; however, at runtime I just get the magnificently detailed message: "Segmentation fault (core dumped)". Upon debugging with GDB it points at:
13 *files[i] = malloc(sizeof(char) * 100);
Now, this is for a introductory course to C programming (for Linux), and you might see numerous errors here. I do however, thank you for your time.
I had no issues getting the code to work without the method / pointers, so I'm sure I might just be mixing up the syntax somehow.
#define _SVID_SOURCE
#include <time.h>
int print_files(char*** files, int* total_files) {
int size = 10;
**files = calloc(size, sizeof(char *));
for(int i = 0; i < size; i++) {
*files[i] = malloc(sizeof(char) * 100);
*total_files = size;
int main() {
char** files;
int num_files;
num_files = 0;
printf("-- Start print_files\n");
print_files(&files, &num_files);
printf("-- end print_files, number of files: %d\n", num_files);
for(int i = 0; i < num_files; i++)
printf("Out: %s\n", files[i]);
printf("total_files=%d\n", num_files);
return 0;
**files = calloc(size, sizeof(char *));
This assumes that not only files points to valid memory, but the value at that memory is also a valid pointer pointing to a pointer, which will be changed.
The problem is
char** files;
print_files(&files, &num_files);
&files is a valid pointer, but (**(&files)) (as dereferenced by print_files) is an illegal deference because files has not been initialized.
That print_files line should probably read
*files = calloc(size, sizeof(char *));
There is also a problem with
*files[i] = malloc(sizeof(char) * 100);
which is equivalent to
*(files[i]) = malloc(sizeof(char) * 100);
I think you probably mean
(*files)[i] = malloc(sizeof(char) * 100);
Here's a working version:
#include <stdio.h>
#include <stdlib.h>
static void print_files(char ***files, int *total_files)
int size = 10;
*files = calloc(size, sizeof(char *));
for (int i = 0; i < size; i++)
(*files)[i] = malloc(sizeof(char) * 100);
sprintf((*files)[i], "Line %d\n", i);
*total_files = size;
int main(void)
char **files;
int num_files;
num_files = 0;
printf("-- Start print_files\n");
print_files(&files, &num_files);
printf("-- end print_files, number of files: %d\n", num_files);
for(int i = 0; i < num_files; i++)
printf("Out: %s\n", files[i]);
printf("total_files=%d\n", num_files);
return 0;
The output is:
-- Start print_files
-- end print_files, number of files: 10
Out: Line 0
Out: Line 1
Out: Line 2
Out: Line 3
Out: Line 4
Out: Line 5
Out: Line 6
Out: Line 7
Out: Line 8
Out: Line 9
valgrind says "leaks like a sieve" but doesn't abuse memory while it is allocated.
What changed?
Triple pointers are scary. However, you want to use only one level of indirection in the assignment with calloc(). (With the double *, GCC warned that files in main() was used uninitialized!) Then, inside the loop, the parentheses around (*files) are critical too. The sprintf() simply serves to initialize the newly allocated string.
I didn't change the main() code significantly.
char **ppchar;
int x;
int y;
ppchar = (char**)malloc(sizeof(char*) * 100);
for (x = 0; x < 100; x++) {
ppchar[x] = (char*)malloc(sizeof(char) * 100);
for(x = 0; x < 100; x++) {
for(y = 0; y < 100; y++) {
ppchar[x][y] = rand() % 255; // ascii range
for(x = 0; x < 100; x++) {
for(y = 0; y < 100; y++) {
// char -128 to 127 or 0 to 255 - it's mostly machine
// dependent. This will tell you.
//make sure to clean up the memory
for (x = 0; x < 100; x++) {
return 0;

Sending 2D array to Cuda Kernel

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
int main()
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.

C pthread Segmentation fault

so I was trying to make a GPGPU emulator with c & pthreads but ran into a rather strange problem which I have no idea why its occurring. The code is as below:
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <assert.h>
// simplifies malloc
#define MALLOC(a) (a *)malloc(sizeof(a))
// Index of x/y coordinate
#define x (0)
#define y (1)
// Defines size of a block
#define BLOCK_DIM_X (3)
#define BLOCK_DIM_Y (2)
// Defines size of the grid, i.e., how many blocks
#define GRID_DIM_X (5)
#define GRID_DIM_Y (7)
// Defines the number of threads in the grid
// execution environment for the kernel
typedef struct exec_env {
int threadIdx[2]; // thread location
int blockIdx[2];
int blockDim[2];
int gridDim[2];
float *A,*B; // parameters for the thread
float *C;
} exec_env;
// kernel
void *kernel(void *arg)
exec_env *env = (exec_env *) arg;
// compute number of threads in a block
int sz = env->blockDim[x] * env->blockDim[y];
// compute the index of the first thread in the block
int k = sz * (env->blockIdx[y]*env->gridDim[x] + env->blockIdx[x]);
// compute the index of a thread inside a block
k = k + env->threadIdx[y]*env->blockDim[x] + env->threadIdx[x];
// check whether it is in range
assert(k >= 0 && k < GRID_SIZE && "Wrong index computation");
// print coordinates in block and grid and computed index
/*printf("tx:%d ty:%d bx:%d by:%d idx:%d\n",env->threadIdx[x],
env->blockIdx[y], k);
// retrieve two operands
float *A = &env->A[k];
float *B = &env->B[k];
printf("%f %f \n",*A, *B);
// retrieve pointer to result
float *C = &env->C[k];
// do actual computation here !!!
// For assignment replace the following line with
// the code to do matrix addition and multiplication.
*C = *A + *B;
// free execution environment (not needed anymore)
return NULL;
// main function
int main(int argc, char **argv)
float A[GRID_SIZE] = {-1};
float B[GRID_SIZE] = {-1};
float C[GRID_SIZE] = {-1};
pthread_t threads[GRID_SIZE];
int i=0, bx, by, tx, ty;
//Error location
/*for (i = 0; i < GRID_SIZE;i++){
A[i] = i;
B[i] = i+1;
printf("%f %f\n ", A[i], B[i]);
// Step 1: create execution environment for threads and create thread
for (bx=0;bx<GRID_DIM_X;bx++) {
for (by=0;by<GRID_DIM_Y;by++) {
for (tx=0;tx<BLOCK_DIM_X;tx++) {
for (ty=0;ty<BLOCK_DIM_Y;ty++) {
exec_env *e = MALLOC(exec_env);
assert(e != NULL && "memory exhausted");
// set parameters
e->A = A;
e->B = B;
e->C = C;
// create thread
pthread_create(&threads[i++],NULL,kernel,(void *)e);
// Step 2: wait for completion of all threads
for (i=0;i<GRID_SIZE;i++) {
pthread_join(threads[i], NULL);
// Step 3: print result
for (i=0;i<GRID_SIZE;i++) {
printf("%f ",C[i]);
return 0;
Ok this code here runs fine, but as soon as I uncomment the "Error Location" (for loop which assigns A[i] = i and B[i] = i + 1, I get snapped by a segmentation fault in unix, and by these random 0s within C in cygwin. I must admit my fundamentals in C is pretty poor, so it may be highly likely that I missed something. If someone can give an idea on what's going wrong it'd be greatly appreciated. Thanks.
It works when you comment that because i is still 0 when the 4 nested loops start.
You have this:
for (i = 0; i < GRID_SIZE;i++){
A[i] = i;
B[i] = i+1;
printf("%f %f\n ", A[i], B[i]);
/* What value is `i` now ? */
And then
pthread_create(&threads[i++],NULL,kernel,(void *)e);
So pthread_create will try to access some interesting indexes indeed.
