How to dynamically allocate arrays inside a kernel? - c

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?
My code is something like that:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float x[n],y[nn];
//Do some really cool and heavy computations here that takes hours.
}
But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.
So, what should I do?
If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.

Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float *x = new float[n], *y = new float[nn];
}
This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.

#talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc() and an alternative you might want to consider.
Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.
The test implements two approaches. The first approach uses __device__ malloc() and the second approach uses memory that is allocated before the kernel runs.
On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc() and 27ms when using pre-allocated memory. In other words, the test takes 56x longer to run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc() / cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc() / cudaFree() is amortized over all the kernel launches. That brings the difference even higher, to around 60x.
Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc() in order to provide separate chunks of memory to each caller.
The version that does not use __device__ malloc() allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().
The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc() is doing in the background, and would see a similar performance hit.
Test the performance of __device__ malloc():
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int N_ITEMS(16);
#define USE_DYNAMIC_MALLOC
__global__ void test_malloc(int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(new int[N_ITEMS]);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
delete[] s;
}
__global__ void test_malloc_2(int* items, int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(items + tx * N_ITEMS);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
}
int main()
{
cudaError_t cuda_status;
cudaSetDevice(0);
int blocks_per_launch(1024 * 10);
int threads_per_block(256);
int threads_per_launch(blocks_per_launch * threads_per_block);
int* totals_d;
cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaDeviceSynchronize();
cudaEventRecord(start, 0);
#ifdef USE_DYNAMIC_MALLOC
cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));
test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
int* items_d;
cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);
test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);
cudaFree(items_d);
#endif
cuda_status = cudaDeviceSynchronize();
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Elapsed: %f\n", elapsedTime);
int* totals_h(new int[threads_per_launch]);
cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
for (int i(0); i < 10; ++i) {
printf("%d ", totals_h[i]);
}
printf("\n");
cudaFree(totals_d);
delete[] totals_h;
return cuda_status;
}
Output:
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080

If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?

Ran an experiment based on the concepts in #rogerdahl's post. Assumptions:
4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100
The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.029794s
I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:
#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 64;
void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void mallocai() {
for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
int * foo;
foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
free(foo);
}
}
int main() {
Timer cuda_malloc_timer("cuda malloc timer");
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
cuda_malloc_timer.stop_and_report();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Timer device_malloc_timer("device malloc timer");
device_malloc_timer.start();
mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
device_malloc_timer.stop_and_report();
}
If you find mistakes, please lmk in the comments, and I'll try to fix them.
And I ran them again with larger everything:
const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 1024;
And cudaMalloc was still slower by a lot:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.167331s

Maybe you should test
cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);
instead
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}

Related

Why does gcc-10 fail to link with error "error: array section is not contiguous in ‘map’ clause" in 2D array openacc application?

I am trying to compile a basic openacc program in C, using gcc-10. It works fine for one-dimensional arrays, and arrays allocated through "A[N_x][N_y]" but when trying a 2D array allocated using malloc, either contiguous or not, I get an error message upon compiling. The example below fails:
#include <stdio.h>
#include <stdlib.h>
int main()
{
int N_x = 1000;
int N_y = 1000;
int i_x;
// allocate
double **A;
A = malloc(N_x * sizeof(double*));
A[0] = malloc(N_x * N_y * sizeof(double));
for (i_x = 0; i_x < N_x; i_x++)
{
A[i_x] = A[0] + i_x * N_y; // contiguous allocation
// A[i_x] = malloc(sizeof(double) * N_y); // non-contiguous allocation
}
// another example of same error
// get onto the GPU
//#pragma acc enter data create (A[0:N_x][0:N_y])
// get out of the GPU
//#pragma acc exit data copyout (A[0:N_x][0:N_y])
// following pragma triggers the "error: array section is not contiguous in ‘map’ clause" error
#pragma acc parallel loop copy(A[0:N_x][0:N_y])
for (i_x = 0; i_x < N_x; i_x++)
A[i_x][i_x] = (double) i_x;
// free
free(A[0]);
free(A);
return 0;
}
Am I missing something obvious here? Thank you for your help. Btw, I compile with
gcc-10 test2.c -fopenacc
on a 64-bit Ubuntu 18.04 LTS system with this GPU card: GeForce GTX 1050 Ti/PCIe/SSE2
The code is fine, but I don't believe GNU supports non-contiguous data segments. I'll need to defer the GNU folks but do believe that they are developing this support in future versions of the compilers.
For now, you'll need to either switch to using the NVIDIA HPC Compiler (https://developer.nvidia.com/hpc-sdk) or refactor the code to use a single dimension array of size N_x*N_y with a computed index. Something like:
#include <stdio.h>
#include <stdlib.h>
#define IDX(n,m,s) ((n*s)+m)
int main()
{
int N_x = 1000;
int N_y = 1000;
int i_x;
// allocate
double *A;
A = malloc(N_x * N_y * sizeof(double*));
#pragma acc enter data create(A[:N_x*N_y])
#pragma acc parallel loop present(A)
for (i_x = 0; i_x < N_x; i_x++)
A[IDX(i_x,i_x,N_x)] = (double) i_x;
// free
free(A);
return 0;
}

Memcpy takes the same time as memset

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.
More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.
operation time(s)
-----------------------------
memset(a,0xff,LEN) 3.7
memcpy(a,b,LEN) 3.9
a[j] += b[j] 9.4
memcpy(a,b,LEN) 3.8
Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.
Then I initialized b to zero with memset(b,0,LEN) and test again:
operation time(s)
-----------------------------
memcpy(a,b,LEN) 8.2
a[j] += b[j] 11.5
Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.
At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.
Why do I only get the time I expect after memset(b,0,LEN)?
test.c
#include <time.h>
#include <string.h>
#include <stdio.h>
void tests(char *a, char *b, const int LEN){
clock_t time0, time1;
time0 = clock();
for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
memset(b,0,LEN);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}
main.c
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
tests(a, b, LEN);
}
Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.
Test system: i7-6700HQ#2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.
I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.
I came up with a simpler test
#include <time.h>
#include <string.h>
#include <stdio.h>
void __attribute__ ((noinline)) foo(char *a, char *b, const int LEN) {
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}
void tests(char *a, char *b, const int LEN) {
foo(a, b, LEN);
memset(b,0,LEN);
foo(a, b, LEN);
}
This outputs.
9.472976
12.728426
However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs
12.5
12.5
This leads me to to think this is a OS allocation issue and not a compiler issue.
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
memset(b,1,LEN);
tests(a, b, LEN);
}
The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.
malloc etc work by:
if the request can be fulfilled by the freelist, carve a chunk out of it
in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued
if not: ask the OS to extend the address space.
For systems with demand paging (COW) (an MMU could help here), the second options winds downto:
create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
add these PTEs to the address space of the process
This will consume no physical memory, except only for the Page Tables.
Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
but, if the new page is written, the COW logic kicks in (via a page fault):
physical memory is allocated
the /dev/zero page is copied to the new page
the new page is detached from the mother page
and the calling process can finally do the update which started all this
Your b array probably was not written after mmap-ing (huge allocation requests with malloc/calloc are usually converted into mmap). And whole array was mmaped to single read-only "zero page" (part of COW mechanism). Reading zeroes from single page is faster than reading from many pages, as single page will be kept in the cache and in TLB. This explains why test before memset(0) was faster:
This outputs. 9.472976 12.728426
However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs: 12.5 12.5
And more about gcc's malloc+memset / calloc+memset optimization into calloc (expanded from my comment)
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
This optimization was proposed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742 (tree-optimization PR57742) at 2013-06-27 by Marc Glisse (https://stackoverflow.com/users/1918193?) as planned for 4.9/5.0 version of GCC:
memset(malloc(n),0,n) -> calloc(n,1)
calloc can sometimes be significantly faster than malloc+bzero because it has special knowledge that some memory is already zero. When other optimizations simplify some code to malloc+memset(0), it would thus be nice to replace it with calloc. Sadly, I don't think there is a way to do a similar optimization in C++ with new, which is where such code most easily appears (creating std::vector(10000) for instance). And there would also be the complication there that the size of the memset would be a bit smaller than that of the malloc (using calloc would still be fine, but it gets harder to know if it is an improvement).
Implemented at 2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=211956 (also https://patchwork.ozlabs.org/patch/325357/)
tree-ssa-strlen.c ...
(handle_builtin_malloc, handle_builtin_memset): New functions.
The current code in gcc/tree-ssa-strlen.c https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 - if memset(0) get pointer from malloc or calloc, it will convert malloc into calloc and then memset(0) will be removed:
/* Handle a call to memset.
After a call to calloc, memset(,0,) is unnecessary.
memset(malloc(n),0,n) is calloc(n,1). */
static bool
handle_builtin_memset (gimple_stmt_iterator *gsi)
...
if (code1 == BUILT_IN_CALLOC)
/* Not touching stmt1 */ ;
else if (code1 == BUILT_IN_MALLOC
&& operand_equal_p (gimple_call_arg (stmt1, 0), size, 0))
{
gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2,
size, build_one_cst (size_type_node));
si1->length = build_int_cst (size_type_node, 0);
si1->stmt = gsi_stmt (gsi1);
}
This was discussed in gcc-patches mailing list in Mar 1, 2014 - Jul 15, 2014 with subject "calloc = malloc + memset"
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01693.html
https://gcc.gnu.org/ml/gcc-patches/2014-03/threads.html#00009
https://gcc.gnu.org/ml/gcc-patches/2014-04/threads.html#00817
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg01392.html
https://gcc.gnu.org/ml/gcc-patches/2014-06/threads.html#00234
https://gcc.gnu.org/ml/gcc-patches/2014-07/threads.html#01059
with notable comment from Andi Kleen (http://halobates.de/blog/, https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html
FWIW i believe the transformation will break a large variety of micro
benchmarks.
calloc internally knows that memory fresh from the OS is zeroed. But
the memory may not be faulted in yet.
memset always faults in the memory.
So if you have some test like
buf = malloc(...)
memset(buf, ...)
start = get_time();
... do something with buf
end = get_time()
Now the times will be completely off because the measured times
includes the page faults.
Marc replied "Good point. I guess working around compiler optimizations is part of the game for micro benchmarks, and their authors would be disappointed if the compiler didn't mess it up regularly in new and entertaining ways ;-)" and Andi asked: "I would prefer to not do it. I'm not sure it has a lot of benefit. If you want to keep it please make sure there is an easy way to turn it off."
Marc shows how to turn this optimization off: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html
Any of these flags works:
-fdisable-tree-strlen
-fno-builtin-malloc
-fno-builtin-memset (assuming you wrote 'memset' explicitly in your code)
-fno-builtin
-ffreestanding
-O1
-Os
In the code, you can hide that the pointer passed to memset is the
one returned by malloc by storing it in a volatile variable, or
any other trick to hide from the compiler that we are doing
memset(malloc(n),0,n).

A simple reduction program in CUDA

In the below code, I am trying to implement a simple parallel reduction with blocksize and number of threads per block being 1024. However, after implementing partial reduction, I wish to see whether my implementation is going right or not and in that process I make the program print the first element of the host memory (after data has been copied from device memory to host memory).
My host memory is initialize with '1' and is copied to device memory for reduction. And the printf statement after the reduction process still gives me '1' at the first element of the array.
Is there a problem in what I am getting to print or is it something logical in the implementation of reduction?
In addition printf statements in the kernel do not print anything. Is there something wrong in my syntax or the call to the printf statement?
My code is as below:
ifndef CUDACC
define CUDACC
endif
include "cuda_runtime.h"
include "device_launch_parameters.h"
include
include
ifndef THREADSPERBLOCK
define THREADSPERBLOCK 1024
endif
ifndef NUMBLOCKS
define NUMBLOCKS 1024
endif
global void reduceKernel(int *c)
{
extern shared int sh_arr[];
int index = blockDim.x*blockIdx.x + threadIdx.x;
int sh_index = threadIdx.x;
// Storing data from Global memory to shared Memory
sh_arr[sh_index] = c[index];
__syncthreads();
for(unsigned int i = blockDim.x/2; i>0 ; i>>=1)
{
if(sh_index < i){
sh_arr[sh_index] += sh_arr[i+sh_index];
}
__syncthreads();
}
if(sh_index ==0)
c[blockIdx.x]=sh_arr[sh_index];
printf("value stored at %d is %d \n", blockIdx.x, c[blockIdx.x]);
return;
}
int main()
{
int *h_a;
int *d_a;
int share_memSize, h_memSize;
size_t d_memSize;
share_memSize = THREADSPERBLOCK*sizeof(int);
h_memSize = THREADSPERBLOCK*NUMBLOCKS;
h_a = (int*)malloc(sizeof(int)*h_memSize);
d_memSize=THREADSPERBLOCK*NUMBLOCKS;
cudaMalloc( (void**)&d_a, h_memSize*sizeof(int));
for(int i=0; i<h_memSize; i++)
{
h_a[i]=1;
};
//printf("last element of array %d \n", h_a[h_memSize-1]);
cudaMemcpy((void**)&d_a, (void**)&h_a, h_memSize, cudaMemcpyHostToDevice);
reduceKernel<<<NUMBLOCKS, THREADSPERBLOCK, share_memSize>>>(d_a);
cudaMemcpy((void**)&h_a, (void**)&d_a, d_memSize, cudaMemcpyDeviceToHost);
printf("sizeof host memory %d \n", d_memSize); //sizeof(h_a));
printf("sum after reduction %d \n", h_a[0]);
}
There are a number of problems with this code.
much of what you've posted is not valid code. As just a few examples, your global and shared keywords are supposed to have double-underscores before and after, like this: __global__ and __shared__. I assume this is some sort of copy-paste error or formatting error. There are problems with your define statements as well. You should endeavor to post code that doesn't have these sorts of problems.
Any time you are having trouble with a CUDA code, you should use proper cuda error checking and run your code with cuda-memcheck before asking for help. If you had done this , it would have focused your attention on item 3 below.
Your cudaMemcpy operations are broken in a couple of ways:
cudaMemcpy((void**)&d_a, (void**)&h_a, h_memSize, cudaMemcpyHostToDevice);
First, unlike cudaMalloc, but like memcpy, cudaMemcpy just takes ordinary pointer arguments. Second, the size of the transfer (like memcpy) is in bytes, so your sizes need to be scaled up by sizeof(int):
cudaMemcpy(d_a, h_a, h_memSize*sizeof(int), cudaMemcpyHostToDevice);
and similarly for the one after the kernel.
printf from every thread in a large kernel (like this one which has 1048576 threads) is probably not a good idea. You won't actually get all the output you expect, and on windows (appears you are running on windows) you may run into a WDDM watchdog timeout due to kernel execution taking too long. If you need to printf from a large kernel, be selective and condition your printf on threadIdx.x and blockIdx.x
The above things are probably enough to get some sensible printout, and as you point out you're not finished yet anyway: "I wish to see whether my implementation is going right or not ". However, this kernel, as crafted, overwrites its input data with output data:
__global__ void reduceKernel(int *c)
...
c[blockIdx.x]=sh_arr[sh_index];
This will lead to a race condition. Rather than trying to sort this out for you, I'd suggest separating your output data from your input data. Even better, you should study the cuda reduction sample code which also has an associated presentation.
Here is a modified version of your code which has most of the above issues fixed. It's still not correct. It still has defect 5 above in it. Rather than completely rewrite your code to fix defect 5, I would direct you to the cuda sample code mentioned above.
$ cat t820.cu
#include <stdio.h>
#ifndef THREADSPERBLOCK
#define THREADSPERBLOCK 1024
#endif
#ifndef NUMBLOCKS
#define NUMBLOCKS 1024
#endif
__global__ void reduceKernel(int *c)
{
extern __shared__ int sh_arr[];
int index = blockDim.x*blockIdx.x + threadIdx.x;
int sh_index = threadIdx.x;
// Storing data from Global memory to shared Memory
sh_arr[sh_index] = c[index];
__syncthreads();
for(unsigned int i = blockDim.x/2; i>0 ; i>>=1)
{
if(sh_index < i){
sh_arr[sh_index] += sh_arr[i+sh_index];
}
__syncthreads();
}
if(sh_index ==0)
c[blockIdx.x]=sh_arr[sh_index];
// printf("value stored at %d is %d \n", blockIdx.x, c[blockIdx.x]);
return;
}
int main()
{
int *h_a;
int *d_a;
int share_memSize, h_memSize;
size_t d_memSize;
share_memSize = THREADSPERBLOCK*sizeof(int);
h_memSize = THREADSPERBLOCK*NUMBLOCKS;
h_a = (int*)malloc(sizeof(int)*h_memSize);
d_memSize=THREADSPERBLOCK*NUMBLOCKS;
cudaMalloc( (void**)&d_a, h_memSize*sizeof(int));
for(int i=0; i<h_memSize; i++)
{
h_a[i]=1;
};
//printf("last element of array %d \n", h_a[h_memSize-1]);
cudaMemcpy(d_a, h_a, h_memSize*sizeof(int), cudaMemcpyHostToDevice);
reduceKernel<<<NUMBLOCKS, THREADSPERBLOCK, share_memSize>>>(d_a);
cudaMemcpy(h_a, d_a, d_memSize*sizeof(int), cudaMemcpyDeviceToHost);
printf("sizeof host memory %d \n", d_memSize); //sizeof(h_a));
printf("first block sum after reduction %d \n", h_a[0]);
}
$ nvcc -o t820 t820.cu
$ cuda-memcheck ./t820
========= CUDA-MEMCHECK
sizeof host memory 1048576
first block sum after reduction 1024
========= ERROR SUMMARY: 0 errors
$

CUDA Array Reduction

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.
Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

program fails for array 30 x 30

This is program for matrix multiplication on CUDA architecture.
This code is working fine when size of array is 30 x 30 but giving output as a series of 0's when size is greater.
I am using standard ec2 instance for CUDA hosted on linux machine. Can anybody figure out the reason ?
#include <stdio.h>
#define SIZE 30
__global__ void matrix_multiply(float *input1,float *input2,float *output,int dimension){
int input1_index = threadIdx.x / dimension * dimension;
int input2_index = threadIdx.x % dimension;
int i=0;
for( i =0; i <dimension; i++){
output[threadIdx.x] += input1[input1_index + i] * input2[input2_index + i * dimension];
}
}
int main(){
int i,j,natural_number=1;
float input1[SIZE][SIZE],input2[SIZE][SIZE],result[SIZE][SIZE]={0};
float *c_input1,*c_input2,*c_result;
for(i=0;i<SIZE;i++){
for(j=0;j<SIZE;j++){
input1[i][j]=input2[i][j]=natural_number++;
}
}
cudaMalloc((void**)&c_input1,sizeof(input1));
cudaMalloc((void**)&c_input2,sizeof(input2));
cudaMalloc((void**)&c_result,sizeof(result));
cudaMemcpy(c_input1,input1,sizeof(input1),cudaMemcpyHostToDevice);
cudaMemcpy(c_input2,input2,sizeof(input2),cudaMemcpyHostToDevice);
cudaMemcpy(c_result,result,sizeof(result),cudaMemcpyHostToDevice);
matrix_multiply<<<1,SIZE * SIZE>>>(c_input1,c_input2,c_result,SIZE);
if(cudaGetLastError()!=cudaSuccess){
printf("%s\n",cudaGetErrorString(cudaGetLastError()));
}
cudaMemcpy(result,c_result,sizeof(result),cudaMemcpyDeviceToHost);
for(i=0;i<SIZE;i++){
for(j=0;j<SIZE;j++){
printf("%.2f ",result[i][j]);
}
printf("\n");
}
cudaFree(c_input1);
cudaFree(c_input2);
cudaFree(c_result);
return 0;
}
You probably have a max of 1024 threads per block on your GPU. 30 x 30 = 900, so that should be OK, but e.g. 40 x 40 would results in a kernel launch failure (take-home message: always check for errors !).
You probably want to consider organizing your data differently, e.g. SIZE blocks of SIZE threads and then call the kernel as:
matrix_multiply<<<SIZE, SIZE>>>(c_input1,c_input2,c_result,SIZE);
(Obviously you'll need to modify your array indexing within the kernel code, e.g. use the block index as the row and the thread index as the column.)
You are invoking the kernel with a configuration of 1 grid with size 30x30:
matrix_multiply<<<1, SIZE * SIZE>>>(c_input1,c_input2,c_result,SIZE);
There are not enough threads to process more.

Resources